Visualize Ethos-U NPU performance with ExecuTorch on Arm FVPs: Evaluate Ethos-U Performance

Visualize Ethos-U NPU performance with ExecuTorch on Arm FVPs

Log an issue

Fork and edit

Discuss on Discord

Visualize Ethos-U NPU performance with ExecuTorch on Arm FVPs

Interpreting the results

Now that you’ve successfully deployed and executed the MobileNet V2 model on the Corstone-320 FVP, this section walks you through how to interpret the resulting performance data. This includes inference time, operator delegation, and hardware-level metrics from the Ethos-U NPU.

Observe Ahead-of-Time Compilation

The following output from run.sh confirms that Ahead-of-Time (AOT) compilation was successful.
Specifically you want to confirm that the original PyTorch model was compiled into an ExecuTorch .pte file
For the MobileNet V2 example, the compiled ExecuTorch file will be output as mv2_arm_delegate_ethos-u85-128.pte

Note

In the examples below, /path/to/executorch represents the directory where you cloned your local copy of the ExecuTorch repo . Replace it with your actual path when running commands or reviewing output.

Ahead-of-Time Compiler Start:

    

        
        __output__--------------------------------------------------------------------------------
__output__Running e2e flow for model 'mv2' with flags '--delegate --quantize --delegate --quantize --intermediates mv2_u85/ --debug --evaluate'
__output__--------------------------------------------------------------------------------
__output__CALL python3 -m examples.arm.aot_arm_compiler --model_name=mv2 --target=ethos-u85-128 --delegate --quantize --delegate --quantize --intermediates mv2_u85/ --debug --evaluate --intermediate=/path/to/executorch/mv2_u85 --output=/path/to/executorch/mv2_u85/mv2_arm_delegate_ethos-u85-128.pte --system_config=Ethos_U85_SYS_DRAM_Mid --memory_mode=Sram_Only

.pte File Build Completion:

    

        
        __output__PTE file saved as /path/to/executorch/mv2_u85/mv2_arm_delegate_ethos-u85-128.pte
__output__pte_data_size:  3809584 /path/to/executorch/mv2_u85/mv2_arm_delegate_ethos-u85-128.pte
__output__pte_file: /path/to/executorch/mv2_u85/mv2_arm_delegate_ethos-u85-128.pte

Ethos-U Delegate Build Start:

    

        
        __output__+ backends/arm/scripts/build_executor_runner.sh --et_build_root=/path/to/executorch/arm_test --pte=/path/to/executorch/mv2_u85/mv2_arm_delegate_ethos-u85-128.pte --build_type=Release --target=ethos-u85-128 --system_config=Ethos_U85_SYS_DRAM_Mid --memory_mode=Sram_Only --extra_build_flags= --ethosu_tools_dir=/path/to/executorch/examples/arm/ethos-u-scratch
__output__--------------------------------------------------------------------------------
__output__Build Arm Baremetal executor_runner for ethos-u85-128 with /path/to/executorch/mv2_u85/mv2_arm_delegate_ethos-u85-128.pte using Ethos_U85_SYS_DRAM_Mid Sram_Only  to '/path/to/executorch/mv2_u85/mv2_arm_delegate_ethos-u85-128/cmake-out'
__output__--------------------------------------------------------------------------------

Ethos-U Delegate Build Completion:

    

        
        __output__[100%] Built target arm_executor_runner

Observe Test Batch Performance

By default, run.sh (and the underlying ethos_u_runner) uses:

A constant input tensor, usually filled with zeros, ones, or random-ish synthetic data
Input shape matches MobileNet V2: typically [1, 3, 224, 224] (batch size 1, 3 RGB channels, 224×224 image)

Input tensor size: 1 × 3 × 224 × 224 × 1 byte = 150528 bytes ≈ 147 KB

    

        
        __output__Input   SRAM bandwidth = 15.49 MB/batch

Batch Inference time gives you a single performance metric for the Ethos-U85 (versus other Ethos-U NPUs )

    

        
        __output__Batch Inference time                 4.94 ms,  202.34 inferences/s (batch size 1)

Test Batch Performance:

    

        
        __output__Network summary for out
__output__Accelerator configuration               Ethos_U85_128
__output__System configuration             Ethos_U85_SYS_DRAM_Mid
__output__Memory mode                                 Sram_Only
__output__Accelerator clock                                1000 MHz
__output__Design peak SRAM bandwidth                      29.80 GB/s
__output__
__output__Total SRAM used                               5178.77 KiB
__output__
__output__CPU operators = 0 (0.0%)
__output__NPU operators = 64 (100.0%)
__output__
__output__Average SRAM bandwidth                           7.21 GB/s
__output__Input   SRAM bandwidth                          15.49 MB/batch
__output__Weight  SRAM bandwidth                          11.87 MB/batch
__output__Output  SRAM bandwidth                           6.66 MB/batch
__output__Total   SRAM bandwidth                          35.65 MB/batch
__output__Total   SRAM bandwidth            per input     35.65 MB/inference (batch size 1)
__output__
__output__Neural network macs                         300836992 MACs/batch
__output__
__output__Info: The numbers below are internal compiler estimates.
__output__For performance numbers the compiled network should be run on an FVP Model or FPGA.
__output__
__output__Network Tops/s                                   0.12 Tops/s
__output__
__output__NPU cycles                                    4832315 cycles/batch
__output__SRAM Access cycles                            1168037 cycles/batch
__output__DRAM Access cycles                                  0 cycles/batch
__output__On-chip Flash Access cycles                         0 cycles/batch
__output__Off-chip Flash Access cycles                        0 cycles/batch
__output__Total cycles                                  4942076 cycles/batch
__output__
__output__Batch Inference time                 4.94 ms,  202.34 inferences/s (batch size 1)

Observe Operator Delegation

This output indicates which operators go to processors:

Ethos-U85 NPU: occurrences_in_delegated_graphs
Cortex-M85 CPU: occurrences_in_non_delegated_graph

    

        
        __output__Total delegated subgraphs: 1
__output__Number of delegated nodes: 419
__output__Number of non-delegated nodes: 3
__output__
__output__Delegation table:
__output__╒════╤════════════════════════════════════════════════════╤═══════════════════════════════════╤═══════════════════════════════════════╕
__output__│    │ op_type                                            │   occurrences_in_delegated_graphs │   occurrences_in_non_delegated_graphs │
__output__╞════╪════════════════════════════════════════════════════╪═══════════════════════════════════╪═══════════════════════════════════════╡
__output__│  0 │ aten_add_tensor                                    │                                10 │                                     0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│  1 │ aten_clone_default                                 │                                 1 │                                     0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│  2 │ aten_convolution_default                           │                                52 │                                     0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│  3 │ aten_hardtanh_default                              │                                35 │                                     0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│  4 │ aten_linear_default                                │                                 1 │                                     0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│  5 │ aten_mean_dim                                      │                                 1 │                                     0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│  6 │ aten_view_copy_default                             │                                 1 │                                     0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│  7 │ cortex_m_dequantize_per_tensor_default             │                                 0 │                                     1 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│  8 │ cortex_m_quantize_per_tensor_default               │                                 0 │                                     1 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│  9 │ getitem                                            │                                 0 │                                     1 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│ 10 │ quantized_decomposed_dequantize_per_tensor_default │                               217 │                                     0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│ 11 │ quantized_decomposed_quantize_per_tensor_default   │                               101 │                                     0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│ 12 │ Total                                              │                               419 │                                     3 │
__output__╘════╧════════════════════════════════════════════════════╧═══════════════════════════════════╧═══════════════════════════════════════╛

Observe the Ethos-U Performance Monitoring Unit

This output shows Ethos-U performance, from the Performance Monitoring Unit (PMU)

    

        
        __output__I [executorch:arm_perf_monitor.cpp:180] Ethos-U PMU report:
__output__I [executorch:arm_perf_monitor.cpp:181] ethosu_pmu_cycle_cntr : 4738932
__output__I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr0 : 1447178
__output__I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr1 : 420661
__output__I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr2 : 0
__output__I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr3 : 0
__output__I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr4 : 130

Table of Ethos-U PMU Counters:

PMU Counter	Default Event Tracked	Description	Interpretation
ethosu_pmu_cycle_cntr	Total NPU cycles	Counts the number of core clock cycles where the Ethos-U NPU was executing work.	High value = long runtime; use to compute throughput.
ethosu_pmu_cntr0	SRAM read data beats received(ETHOSU_PMU_SRAM_RD_DATA_BEAT_RECEIVED)	How many data beats (e.g., 64-bit words) the NPU read from local SRAM.	Indicates input + weight loading efficiency.
ethosu_pmu_cntr1	SRAM write data beats written(ETHOSU_PMU_SRAM_WR_DATA_BEAT_WRITTEN)	Number of data beats the NPU wrote back to SRAM (e.g., outputs or intermediate results).	Reflects output bandwidth usage.
ethosu_pmu_cntr2	External DRAM read beats(ETHOSU_PMU_EXT_RD_DATA_BEAT_RECEIVED)	Number of data beats read from off-chip memory (e.g., DRAM). Often 0 if Sram_Only is used.	If non-zero, may indicate cache misses or large model size.
ethosu_pmu_cntr3	External DRAM write beats(ETHOSU_PMU_EXT_WR_DATA_BEAT_WRITTEN)	Number of write data beats to external memory.	Helps detect offloading or insufficient SRAM.
ethosu_pmu_cntr4	Idle cycles(ETHOSU_PMU_NPU_IDLE)	Number of cycles where the NPU had no work scheduled (i.e., idle).	High idle count = possible pipeline stalls or bad scheduling.

Review

In this Learning Path, you have learned how to deploy a MobileNet V2 model using ExecuTorch on Arm’s Corstone-320 FVP. You’re now ready to apply what you’ve learned to other models and configurations using ExecuTorch.

Back

Visualize Ethos-U NPU performance with ExecuTorch on Arm FVPs

Introduction

Overview

Understand the ExecuTorch workflow

Set up your ExecuTorch environment

Set up the Corstone-320 Fixed Virtual Platform

Deploy and run Mobilenet V2 on the Corstone-320 FVP

Enable GUI and deploy a model on Corstone-320 FVP

Evaluate Ethos-U Performance

Next Steps

Visualize Ethos-U NPU performance with ExecuTorch on Arm FVPs

Interpreting the results

Observe Ahead-of-Time Compilation

Observe Test Batch Performance

Observe Operator Delegation

Observe the Ethos-U Performance Monitoring Unit

Review