Now that you’ve successfully deployed and executed the MobileNet V2 model on the Corstone-320 FVP, this section walks you through how to interpret the resulting performance data. This includes inference time, operator delegation, and hardware-level metrics from the Ethos-U NPU.
.pte
filemv2_arm_delegate_ethos-u85-128.pte
In the examples below, /path/to/executorch
represents the directory where you cloned your local copy of the
ExecuTorch repo
. Replace it with your actual path when running commands or reviewing output.
Ahead-of-Time Compiler Start:
__output__--------------------------------------------------------------------------------
__output__Running e2e flow for model 'mv2' with flags '--delegate --quantize --delegate --quantize --intermediates mv2_u85/ --debug --evaluate'
__output__--------------------------------------------------------------------------------
__output__CALL python3 -m examples.arm.aot_arm_compiler --model_name=mv2 --target=ethos-u85-128 --delegate --quantize --delegate --quantize --intermediates mv2_u85/ --debug --evaluate --intermediate=/path/to/executorch/mv2_u85 --output=/path/to/executorch/mv2_u85/mv2_arm_delegate_ethos-u85-128.pte --system_config=Ethos_U85_SYS_DRAM_Mid --memory_mode=Sram_Only
.pte File Build Completion:
__output__PTE file saved as /path/to/executorch/mv2_u85/mv2_arm_delegate_ethos-u85-128.pte
__output__pte_data_size: 3809584 /path/to/executorch/mv2_u85/mv2_arm_delegate_ethos-u85-128.pte
__output__pte_file: /path/to/executorch/mv2_u85/mv2_arm_delegate_ethos-u85-128.pte
Ethos-U Delegate Build Start:
__output__+ backends/arm/scripts/build_executor_runner.sh --et_build_root=/path/to/executorch/arm_test --pte=/path/to/executorch/mv2_u85/mv2_arm_delegate_ethos-u85-128.pte --build_type=Release --target=ethos-u85-128 --system_config=Ethos_U85_SYS_DRAM_Mid --memory_mode=Sram_Only --extra_build_flags= --ethosu_tools_dir=/path/to/executorch/examples/arm/ethos-u-scratch
__output__--------------------------------------------------------------------------------
__output__Build Arm Baremetal executor_runner for ethos-u85-128 with /path/to/executorch/mv2_u85/mv2_arm_delegate_ethos-u85-128.pte using Ethos_U85_SYS_DRAM_Mid Sram_Only to '/path/to/executorch/mv2_u85/mv2_arm_delegate_ethos-u85-128/cmake-out'
__output__--------------------------------------------------------------------------------
Ethos-U Delegate Build Completion:
__output__[100%] Built target arm_executor_runner
By default, run.sh
(and the underlying ethos_u_runner) uses:
[1, 3, 224, 224]
(batch size 1, 3 RGB channels, 224×224 image)1 × 3 × 224 × 224 × 1 byte = 150528 bytes ≈ 147 KB
__output__Input SRAM bandwidth = 15.49 MB/batch
Batch Inference time
gives you a single performance metric for the Ethos-U85 (versus other
Ethos-U NPUs
)
__output__Batch Inference time 4.94 ms, 202.34 inferences/s (batch size 1)
Test Batch Performance:
__output__Network summary for out
__output__Accelerator configuration Ethos_U85_128
__output__System configuration Ethos_U85_SYS_DRAM_Mid
__output__Memory mode Sram_Only
__output__Accelerator clock 1000 MHz
__output__Design peak SRAM bandwidth 29.80 GB/s
__output__
__output__Total SRAM used 5178.77 KiB
__output__
__output__CPU operators = 0 (0.0%)
__output__NPU operators = 64 (100.0%)
__output__
__output__Average SRAM bandwidth 7.21 GB/s
__output__Input SRAM bandwidth 15.49 MB/batch
__output__Weight SRAM bandwidth 11.87 MB/batch
__output__Output SRAM bandwidth 6.66 MB/batch
__output__Total SRAM bandwidth 35.65 MB/batch
__output__Total SRAM bandwidth per input 35.65 MB/inference (batch size 1)
__output__
__output__Neural network macs 300836992 MACs/batch
__output__
__output__Info: The numbers below are internal compiler estimates.
__output__For performance numbers the compiled network should be run on an FVP Model or FPGA.
__output__
__output__Network Tops/s 0.12 Tops/s
__output__
__output__NPU cycles 4832315 cycles/batch
__output__SRAM Access cycles 1168037 cycles/batch
__output__DRAM Access cycles 0 cycles/batch
__output__On-chip Flash Access cycles 0 cycles/batch
__output__Off-chip Flash Access cycles 0 cycles/batch
__output__Total cycles 4942076 cycles/batch
__output__
__output__Batch Inference time 4.94 ms, 202.34 inferences/s (batch size 1)
This output indicates which operators go to processors:
occurrences_in_delegated_graphs
occurrences_in_non_delegated_graph
__output__Total delegated subgraphs: 1
__output__Number of delegated nodes: 419
__output__Number of non-delegated nodes: 3
__output__
__output__Delegation table:
__output__╒════╤════════════════════════════════════════════════════╤═══════════════════════════════════╤═══════════════════════════════════════╕
__output__│ │ op_type │ occurrences_in_delegated_graphs │ occurrences_in_non_delegated_graphs │
__output__╞════╪════════════════════════════════════════════════════╪═══════════════════════════════════╪═══════════════════════════════════════╡
__output__│ 0 │ aten_add_tensor │ 10 │ 0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│ 1 │ aten_clone_default │ 1 │ 0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│ 2 │ aten_convolution_default │ 52 │ 0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│ 3 │ aten_hardtanh_default │ 35 │ 0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│ 4 │ aten_linear_default │ 1 │ 0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│ 5 │ aten_mean_dim │ 1 │ 0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│ 6 │ aten_view_copy_default │ 1 │ 0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│ 7 │ cortex_m_dequantize_per_tensor_default │ 0 │ 1 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│ 8 │ cortex_m_quantize_per_tensor_default │ 0 │ 1 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│ 9 │ getitem │ 0 │ 1 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│ 10 │ quantized_decomposed_dequantize_per_tensor_default │ 217 │ 0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│ 11 │ quantized_decomposed_quantize_per_tensor_default │ 101 │ 0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│ 12 │ Total │ 419 │ 3 │
__output__╘════╧════════════════════════════════════════════════════╧═══════════════════════════════════╧═══════════════════════════════════════╛
This output shows Ethos-U performance, from the Performance Monitoring Unit (PMU)
__output__I [executorch:arm_perf_monitor.cpp:180] Ethos-U PMU report:
__output__I [executorch:arm_perf_monitor.cpp:181] ethosu_pmu_cycle_cntr : 4738932
__output__I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr0 : 1447178
__output__I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr1 : 420661
__output__I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr2 : 0
__output__I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr3 : 0
__output__I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr4 : 130
Table of Ethos-U PMU Counters:
PMU Counter | Default Event Tracked | Description | Interpretation |
---|---|---|---|
ethosu_pmu_cycle_cntr | Total NPU cycles | Counts the number of core clock cycles where the Ethos-U NPU was executing work. | High value = long runtime; use to compute throughput. |
ethosu_pmu_cntr0 | SRAM read data beats received(ETHOSU_PMU_SRAM_RD_DATA_BEAT_RECEIVED) | How many data beats (e.g., 64-bit words) the NPU read from local SRAM. | Indicates input + weight loading efficiency. |
ethosu_pmu_cntr1 | SRAM write data beats written(ETHOSU_PMU_SRAM_WR_DATA_BEAT_WRITTEN) | Number of data beats the NPU wrote back to SRAM (e.g., outputs or intermediate results). | Reflects output bandwidth usage. |
ethosu_pmu_cntr2 | External DRAM read beats(ETHOSU_PMU_EXT_RD_DATA_BEAT_RECEIVED) | Number of data beats read from off-chip memory (e.g., DRAM). Often 0 if Sram_Only is used. | If non-zero, may indicate cache misses or large model size. |
ethosu_pmu_cntr3 | External DRAM write beats(ETHOSU_PMU_EXT_WR_DATA_BEAT_WRITTEN) | Number of write data beats to external memory. | Helps detect offloading or insufficient SRAM. |
ethosu_pmu_cntr4 | Idle cycles(ETHOSU_PMU_NPU_IDLE) | Number of cycles where the NPU had no work scheduled (i.e., idle). | High idle count = possible pipeline stalls or bad scheduling. |
In this Learning Path, you have learned how to deploy a MobileNet V2 model using ExecuTorch on Arm’s Corstone-320 FVP. You’re now ready to apply what you’ve learned to other models and configurations using ExecuTorch.