Introduction
Run multimodal inference with MNN on Armv9
Build MNN and prepare an Omni model on Armv9
Validate text-only inference with an Omni model on Armv9
Run a vision retail shelf audit with MNN Omni
Convert spoken restock notes into structured tickets with MNN Omni
Build a single-shot multimodal restock ticket with MNN Omni
Next Steps
In this section, you run a text-only baseline using the Omni model on an Armv9 Linux system. Before adding image and audio inputs, this baseline helps you confirm that the core inference path is working correctly with a simple prompt and predictable output behavior.
By the end of this section, you will be able to:
Create a small prompt file in your workspace:
cat > ~/mnn/text_baseline_prompt.txt <<'EOF'
You are an on-device inference assistant. In one short sentence, describe the benefits of multimodal on-device inference.
EOF
llm_demo commonly treats each line in the file as a separate prompt. Keep each prompt on a single line.Run llm_demo from the MNN build directory and pass both the model configuration and prompt file:
cd ~/mnn/MNN/build
./llm_demo ~/mnn/Qwen2.5-Omni-7B-MNN/config.json ~/mnn/text_baseline_prompt.txt
A successful run should load the model, report the detected CPU features, and return a text response for the prompt.
The output is similar to:
config path is /home/radxa/mnn/Qwen2.5-Omni-7B-MNN/config.json
CPU Group: [ 1 2 3 4 ], 799999 - 1800968
CPU Group: [ 7 8 ], 799897 - 2199795
CPU Group: [ 5 6 ], 799897 - 2299896
CPU Group: [ 9 10 ], 799897 - 2399998
CPU Group: [ 0 11 ], 799897 - 2500100
The device supports: i8sdot:1, fp16:1, i8mm: 1, sve2: 1, sme2: 0
main, 274, cost time: 5683.311035 ms
Prepare for tuning opt Begin
Prepare for tuning opt End
main, 282, cost time: 751.726013 ms
prompt file is /home/radxa/mnn/text_baseline_prompt.txt
The benefits are: - It reduces the need for cloud-based services. - It can be used in areas with limited internet connectivity. - It can save on data transfer costs. - It can be used for real-time processing. - It can improve the privacy of data processing.
Multimodal on-device inference has several benefits, including reducing the need for cloud-based services, allowing usage in areas with poor internet connectivity, saving on data transfer costs, enabling real-time processing, and enhancing privacy during data processing. It's a great way to handle various tasks efficiently, especially when you don't want to rely too much on the cloud.
When you run llm_demo, you may see a line reporting CPU hardware features:
The device supports: i8sdot:1, fp16:1, i8mm: 1, sve2: 1, sme2: 0
This shows the Arm architecture features are available on your CPU:
A value of 1 means supported, 0 means not supported.
To verify that the prompt file is read line by line, append a second prompt:
cat >> ~/mnn/text_baseline_prompt.txt <<'EOF'
What is the Arm CPU architecture?
EOF
Run the same command again:
cd ~/mnn/MNN/build
./llm_demo ~/mnn/Qwen2.5-Omni-7B-MNN/config.json ~/mnn/text_baseline_prompt.txt
You should now see two responses, one for each line in the prompt file.
config path is /home/radxa/mnn/Qwen2.5-Omni-7B-MNN/config.json
CPU Group: [ 1 2 3 4 ], 799999 - 1800968
CPU Group: [ 7 8 ], 799897 - 2199795
CPU Group: [ 5 6 ], 799897 - 2299896
CPU Group: [ 9 10 ], 799897 - 2399998
CPU Group: [ 0 11 ], 799897 - 2500100
The device supports: i8sdot:1, fp16:1, i8mm: 1, sve2: 1, sme2: 0
main, 274, cost time: 5700.204102 ms
Prepare for tuning opt Begin
Prepare for tuning opt End
main, 282, cost time: 784.388000 ms
prompt file is /home/radxa/mnn/text_baseline_prompt.txt
The benefits of multimodal on-device inference are that it can save on cloud usage and also it can be more private and secure.
[...]
Arm is a processor architecture. It's used in mobile phones, embedded systems, etc. It has a RISC design, efficient performance, and low power consumption.
Output length and coherence can vary significantly without a generation length limit. If a response repeats itself, this is normal model behavior without a max_new_tokens constraint. Focus on confirming that two distinct responses appear, one for each prompt line.
If you want to quickly verify interactive mode:
cd ~/mnn/MNN/build
./llm_demo ~/mnn/Qwen2.5-Omni-7B-MNN/config.json
Enter a short prompt and confirm that the model returns a reply. Press Ctrl+C to quit.
In this section, you:
You’ve confirmed that llm_demo loads config.json successfully, processes prompts correctly, and returns non-empty text output without crashes.
In the next section, you’ll add an image input and validate the vision path of the Omni model on Armv9.