Introduction
Run multimodal inference with MNN on Armv9
Build MNN and prepare an Omni model on Armv9
Validate text-only inference with an Omni model on Armv9
Run a vision retail shelf audit with MNN Omni
Convert spoken restock notes into structured tickets with MNN Omni
Build a single-shot multimodal restock ticket with MNN Omni
Next Steps
In the previous sections, you validated text, vision, and audio inference in isolation. This final section brings those three inputs together into a single prompt that combines a shelf image and a voice note to produce one structured restock ticket.
By the end of this section, you will have completed a full end-to-end multimodal workflow on an Armv9 CPU: shelf coverage from an image, tasks and deadlines from audio, and a combined ticket from one inference call.
In this final section, you combine image and audio inputs into a single multimodal prompt and generate one operational restock ticket on an Armv9 system.
This section demonstrates end-to-end multimodal task synthesis on an Armv9 CPU. Instead of testing a single modality in isolation, you combine image and audio inputs to produce one structured restocking ticket that can support a store action.
You will combine:
into one multimodal inference using MNN Omni.
This mirrors a practical retail workflow. A store associate captures a shelf photo, records a short voice note, and the system produces a structured ticket that can be acted on immediately.
This section reuses the same assets from the previous examples:
~/mnn/assets/Pet_Food_Aisle.jpg~/mnn/assets/restock_note.wavUsing local files keeps the example reproducible and avoids downloader or redirect issues.
To keep the output readable and easy to validate, request a single-line ticket with semicolon-separated fields.
The ticket should include:
The goal is not perfect retail planning. The goal is to show that one prompt can combine visual context and spoken instructions into a compact result.
Create the prompt file and keep the full prompt on one line:
cat > ~/mnn/prompt_final_multimodal.txt <<EOF
<img>$HOME/mnn/assets/Pet_Food_Aisle.jpg</img> <audio>$HOME/mnn/assets/restock_note.wav</audio> You are an on-device retail replenishment assistant. Use the IMAGE to estimate facing coverage for the main LEFT shelf by level as high|medium|low and pick a priority zone. Use the AUDIO to extract item requests including quantities, zones, deadline, and substitution policy. Output EXACTLY ONE line using bullet-style segments separated by semicolons: Restock ticket; - Location: pet food aisle left shelf; - Coverage: top=<high|medium|low>, middle=<high|medium|low>, bottom=<high|medium|low>; - Priority zone: <top|middle|bottom>-<left|center|right>; - Task 1: <generic item> | qty <number or NOT_SURE> | zone <top|middle|bottom>-<left|center|right or NOT_SURE>; - Task 2: <generic item> | qty <number or NOT_SURE> | zone <top|middle|bottom>-<left|center|right or NOT_SURE>; - Deadline: <time or NOT_SURE>; - Substitution: <use similar substitute|no substitute|NOT_SURE>; - Notes: <short operational note>; - Confidence: <high|medium|low>. Rules: Qty/Deadline/Substitution must come from audio; if audio contains explicit numbers you must output them.
EOF
Run llm_demo with the model configuration and the multimodal prompt:
cd ~/mnn/MNN/build
./llm_demo ~/mnn/Qwen2.5-Omni-7B-MNN/config.json ~/mnn/prompt_final_multimodal.txt
The output is similar to:
config path is /home/radxa/mnn/Qwen2.5-Omni-7B-MNN/config.json
CPU Group: [ 1 2 3 4 ], 799999 - 1800968
CPU Group: [ 7 8 ], 799897 - 2199795
CPU Group: [ 5 6 ], 799897 - 2299896
CPU Group: [ 9 10 ], 799897 - 2399998
CPU Group: [ 0 11 ], 799897 - 2500100
The device supports: i8sdot:1, fp16:1, i8mm: 1, sve2: 1, sme2: 0
main, 274, cost time: 5781.158203 ms
Prepare for tuning opt Begin
Prepare for tuning opt End
main, 282, cost time: 787.130981 ms
prompt file is /home/radxa/mnn/prompt_final_multimodal.txt
Restock the bottom left large pet food bag, add 10 bags; also restock the middle left canned food, add 24 cans. Finish before 3 PM today. If something is out of stock, use a similar substitute.- Restock ticket: Location: pet food aisle left shelf, Coverage: top=low, middle=low, bottom=high, Priority zone: bottom-left, Task 1: pet food | qty 10 | zone bottom-left, Task 2: canned food | qty 24 | zone middle-left, Deadline: 3 PM today, Substitution: use similar substitute, Notes: Ensure proper storage conditions, Confidence: high.
The exact wording can vary. What matters is that image-derived fields stay tied to the shelf photo, and audio-derived fields stay tied to the spoken instruction.
To verify that the model is truly multimodal, perform a simple A/B test:
Swap the audio
Keep the image fixed and replace the voice note with a different instruction.
Verify that:
Coverage and Priority zone stay similarTask, qty, Deadline, and Substitution change to match the new audioSwap the image
Keep the audio fixed and replace the shelf image.
Verify that:
Coverage and Priority zone change with the new imageTask, qty, Deadline, and Substitution remain aligned with the original audioCheck that:
NOT_SURE rather than inventedIf the model adds extra text before or after the ticket, tighten the prompt and repeat the run.
You have completed this Learning Path when you can:
✓ Run text, vision, audio, and combined multimodal examples on Armv9
✓ Generate a final ticket from one image and one voice note
✓ Verify that changing one modality only affects the fields derived from that modality
In this section, you:
You’ve built a complete multimodal inference workflow on Armv9 using only the CPU:
Extend the workflow by: