Now that you have validated ONNX Runtime with Python-based timing (for example, the SqueezeNet baseline test), you can move to using a dedicated benchmarking utility called onnxruntime_perf_test
. This tool is designed for systematic performance evaluation of ONNX models, allowing you to capture more detailed statistics than simple Python timing.
This approach helps you evaluate ONNX Runtime efficiency on Azure Arm64-based Cobalt 100 instances and compare results with other architectures if needed.
You are ready to run benchmarks, which is a key skill for optimizing real-world deployments.
The onnxruntime_perf_test
tool is included in the ONNX Runtime source code. You can use it to measure the inference performance of ONNX models and compare different execution providers (such as CPU or GPU). On Arm64 VMs, CPU execution is the focus.
Before building or running onnxruntime_perf_test
, you need to install a set of development tools and libraries. These packages are required for compiling ONNX Runtime and handling model serialization via Protocol Buffers.
sudo apt update
sudo apt install -y build-essential cmake git unzip pkg-config
sudo apt install -y protobuf-compiler libprotobuf-dev libprotoc-dev git
Then verify protobuf installation:
protoc --version
You should see output similar to:
libprotoc 3.21.12
The benchmarking tool onnxruntime_perf_test
isn’t available as a pre-built binary for any platform, so you will need to build it from source. This process can take up to 40 minutes.
Clone the ONNX Runtime repository:
git clone --recursive https://github.com/microsoft/onnxruntime
cd onnxruntime
Now, build the benchmark tool:
./build.sh --config Release --build_dir build/Linux --build_shared_lib --parallel --build --update --skip_tests
If the build completes successfully, you should see the executable at:
./build/Linux/Release/onnxruntime_perf_test
Now that you have built the benchmarking tool, you can run inference benchmarks on the SqueezeNet INT8 model:
./build/Linux/Release/onnxruntime_perf_test -e cpu -r 100 -m times -s -Z -I ../squeezenet-int8.onnx
Breakdown of the flags:
-e cpu
: use the CPU execution provider.-r 100
: run 100 inference passes for statistical reliability.-m times
: run in “repeat N times” mode for latency-focused measurement.-s
: print summary statistics after the run.-Z
: disable memory arena for more consistent timing.-I ../squeezenet-int8.onnx
: path to your ONNX model file.You should see output with latency and throughput statistics. If you encounter build errors, check that you have enough memory (at least 8 GB recommended) and all dependencies are installed. For missing dependencies, review the installation steps above.
If the benchmark runs successfully, you are ready to analyze and optimize your ONNX model performance on Arm-based Azure infrastructure.
Well done! You have completed a full benchmarking workflow. Continue to the next section to explore further optimizations or advanced deployment scenarios. -s → Show detailed per-run statistics (latency distribution). -Z → Disable intra-op thread spinning. Reduces CPU waste when idle between runs, especially on high-core systems like Cobalt 100. -I → Input the ONNX model path directly, skipping pre-generated test data.
You should see output similar to:
Disabling intra-op thread spinning between runs
Session creation time cost: 0.0102016 s
First inference time cost: 2 ms
Total inference time cost: 0.185739 s
Total inference requests: 100
Average inference time cost: 1.85739 ms
Total inference run time: 0.18581 s
Number of inferences per second: 538.184
Avg CPU usage: 96 %
Peak working set size: 36696064 bytes
Avg CPU usage:96
Peak working set size:36696064
Runs:100
Min Latency: 0.00183404 s
Max Latency: 0.00190312 s
P50 Latency: 0.00185674 s
P90 Latency: 0.00187215 s
P95 Latency: 0.00187393 s
P99 Latency: 0.00190312 s
P999 Latency: 0.00190312 s
Here is a summary of benchmark results collected on an Arm64 D4ps_v6 Ubuntu Pro 24.04 LTS virtual machine.
Metric | Value |
---|---|
Average Inference Time | 1.857 ms |
Throughput | 538.18 inferences/sec |
CPU Utilization | 96% |
Peak Memory Usage | 36.70 MB |
P50 Latency | 1.857 ms |
P90 Latency | 1.872 ms |
P95 Latency | 1.874 ms |
P99 Latency | 1.903 ms |
P999 Latency | 1.903 ms |
Max Latency | 1.903 ms |
Latency Consistency | Consistent |
These results on Arm64 virtual machines demonstrate low-latency inference, with consistent average inference times of approximately 1.86 ms. Throughput remains strong and stable, sustaining over 538 inferences per second using the squeezenet-int8.onnx
model on D4ps_v6 instances. The resource footprint is lightweight, as peak memory usage stays below 37 MB and CPU utilization is around 96%, making this setup ideal for efficient edge or cloud inference. Performance is also consistent, with P50, P95, and maximum latency values tightly grouped, showcasing reliable results on Azure Cobalt 100 Arm-based infrastructure.
You have now successfully benchmarked inference time of ONNX models on an Azure Cobalt 100 Arm64 virtual machine.