Interpreting the results

Now that you’ve successfully deployed and executed the MobileNet V2 model on the Corstone-320 FVP, this section walks you through how to interpret the resulting performance data. This includes inference time, operator delegation, and hardware-level metrics from the Ethos-U NPU.

Observe Ahead-of-Time Compilation

  • The following output from run.sh confirms that Ahead-of-Time (AOT) compilation was successful.
  • Specifically you want to confirm that the original PyTorch model was compiled into an ExecuTorch .pte file
  • For the MobileNet V2 example, the compiled ExecuTorch file will be output as mv2_arm_delegate_ethos-u85-128.pte
Note

In the examples below, /path/to/executorch represents the directory where you cloned your local copy of the ExecuTorch repo . Replace it with your actual path when running commands or reviewing output.

Ahead-of-Time Compiler Start:

    

        
        __output__--------------------------------------------------------------------------------
__output__Running e2e flow for model 'mv2' with flags '--delegate --quantize --delegate --quantize --intermediates mv2_u85/ --debug --evaluate'
__output__--------------------------------------------------------------------------------
__output__CALL python3 -m examples.arm.aot_arm_compiler --model_name=mv2 --target=ethos-u85-128 --delegate --quantize --delegate --quantize --intermediates mv2_u85/ --debug --evaluate --intermediate=/path/to/executorch/mv2_u85 --output=/path/to/executorch/mv2_u85/mv2_arm_delegate_ethos-u85-128.pte --system_config=Ethos_U85_SYS_DRAM_Mid --memory_mode=Sram_Only

        
    

.pte File Build Completion:

    

        
        __output__PTE file saved as /path/to/executorch/mv2_u85/mv2_arm_delegate_ethos-u85-128.pte
__output__pte_data_size:  3809584 /path/to/executorch/mv2_u85/mv2_arm_delegate_ethos-u85-128.pte
__output__pte_file: /path/to/executorch/mv2_u85/mv2_arm_delegate_ethos-u85-128.pte

        
    

Ethos-U Delegate Build Start:

    

        
        __output__+ backends/arm/scripts/build_executor_runner.sh --et_build_root=/path/to/executorch/arm_test --pte=/path/to/executorch/mv2_u85/mv2_arm_delegate_ethos-u85-128.pte --build_type=Release --target=ethos-u85-128 --system_config=Ethos_U85_SYS_DRAM_Mid --memory_mode=Sram_Only --extra_build_flags= --ethosu_tools_dir=/path/to/executorch/examples/arm/ethos-u-scratch
__output__--------------------------------------------------------------------------------
__output__Build Arm Baremetal executor_runner for ethos-u85-128 with /path/to/executorch/mv2_u85/mv2_arm_delegate_ethos-u85-128.pte using Ethos_U85_SYS_DRAM_Mid Sram_Only  to '/path/to/executorch/mv2_u85/mv2_arm_delegate_ethos-u85-128/cmake-out'
__output__--------------------------------------------------------------------------------

        
    

Ethos-U Delegate Build Completion:

    

        
        __output__[100%] Built target arm_executor_runner

        
    

Observe Test Batch Performance

By default, run.sh (and the underlying ethos_u_runner) uses:

  • A constant input tensor, usually filled with zeros, ones, or random-ish synthetic data
  • Input shape matches MobileNet V2: typically [1, 3, 224, 224] (batch size 1, 3 RGB channels, 224×224 image)
  • Input tensor size: 1 × 3 × 224 × 224 × 1 byte = 150528 bytes ≈ 147 KB
        
    
            
            __output__Input   SRAM bandwidth = 15.49 MB/batch
    
            
        
    
  • Batch Inference time gives you a single performance metric for the Ethos-U85 (versus other Ethos-U NPUs )
        
    
            
            __output__Batch Inference time                 4.94 ms,  202.34 inferences/s (batch size 1)
    
            
        
    

Test Batch Performance:

    

        
        __output__Network summary for out
__output__Accelerator configuration               Ethos_U85_128
__output__System configuration             Ethos_U85_SYS_DRAM_Mid
__output__Memory mode                                 Sram_Only
__output__Accelerator clock                                1000 MHz
__output__Design peak SRAM bandwidth                      29.80 GB/s
__output__
__output__Total SRAM used                               5178.77 KiB
__output__
__output__CPU operators = 0 (0.0%)
__output__NPU operators = 64 (100.0%)
__output__
__output__Average SRAM bandwidth                           7.21 GB/s
__output__Input   SRAM bandwidth                          15.49 MB/batch
__output__Weight  SRAM bandwidth                          11.87 MB/batch
__output__Output  SRAM bandwidth                           6.66 MB/batch
__output__Total   SRAM bandwidth                          35.65 MB/batch
__output__Total   SRAM bandwidth            per input     35.65 MB/inference (batch size 1)
__output__
__output__Neural network macs                         300836992 MACs/batch
__output__
__output__Info: The numbers below are internal compiler estimates.
__output__For performance numbers the compiled network should be run on an FVP Model or FPGA.
__output__
__output__Network Tops/s                                   0.12 Tops/s
__output__
__output__NPU cycles                                    4832315 cycles/batch
__output__SRAM Access cycles                            1168037 cycles/batch
__output__DRAM Access cycles                                  0 cycles/batch
__output__On-chip Flash Access cycles                         0 cycles/batch
__output__Off-chip Flash Access cycles                        0 cycles/batch
__output__Total cycles                                  4942076 cycles/batch
__output__
__output__Batch Inference time                 4.94 ms,  202.34 inferences/s (batch size 1)

        
    

Observe Operator Delegation

This output indicates which operators go to processors:

  • Ethos-U85 NPU: occurrences_in_delegated_graphs
  • Cortex-M85 CPU: occurrences_in_non_delegated_graph
    

        
        __output__Total delegated subgraphs: 1
__output__Number of delegated nodes: 419
__output__Number of non-delegated nodes: 3
__output__
__output__Delegation table:
__output__╒════╤════════════════════════════════════════════════════╤═══════════════════════════════════╤═══════════════════════════════════════╕
__output__│    │ op_type                                            │   occurrences_in_delegated_graphs │   occurrences_in_non_delegated_graphs │
__output__╞════╪════════════════════════════════════════════════════╪═══════════════════════════════════╪═══════════════════════════════════════╡
__output__│  0 │ aten_add_tensor                                    │                                10 │                                     0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│  1 │ aten_clone_default                                 │                                 1 │                                     0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│  2 │ aten_convolution_default                           │                                52 │                                     0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│  3 │ aten_hardtanh_default                              │                                35 │                                     0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│  4 │ aten_linear_default                                │                                 1 │                                     0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│  5 │ aten_mean_dim                                      │                                 1 │                                     0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│  6 │ aten_view_copy_default                             │                                 1 │                                     0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│  7 │ cortex_m_dequantize_per_tensor_default             │                                 0 │                                     1 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│  8 │ cortex_m_quantize_per_tensor_default               │                                 0 │                                     1 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│  9 │ getitem                                            │                                 0 │                                     1 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│ 10 │ quantized_decomposed_dequantize_per_tensor_default │                               217 │                                     0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│ 11 │ quantized_decomposed_quantize_per_tensor_default   │                               101 │                                     0 │
__output__├────┼────────────────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
__output__│ 12 │ Total                                              │                               419 │                                     3 │
__output__╘════╧════════════════════════════════════════════════════╧═══════════════════════════════════╧═══════════════════════════════════════╛

        
    

Observe the Ethos-U Performance Monitoring Unit

This output shows Ethos-U performance, from the Performance Monitoring Unit (PMU)

    

        
        __output__I [executorch:arm_perf_monitor.cpp:180] Ethos-U PMU report:
__output__I [executorch:arm_perf_monitor.cpp:181] ethosu_pmu_cycle_cntr : 4738932
__output__I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr0 : 1447178
__output__I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr1 : 420661
__output__I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr2 : 0
__output__I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr3 : 0
__output__I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr4 : 130

        
    

Table of Ethos-U PMU Counters:

PMU CounterDefault Event TrackedDescriptionInterpretation
ethosu_pmu_cycle_cntrTotal NPU cyclesCounts the number of core clock cycles where the Ethos-U NPU was executing work.High value = long runtime; use to compute throughput.
ethosu_pmu_cntr0SRAM read data beats received(ETHOSU_PMU_SRAM_RD_DATA_BEAT_RECEIVED)How many data beats (e.g., 64-bit words) the NPU read from local SRAM.Indicates input + weight loading efficiency.
ethosu_pmu_cntr1SRAM write data beats written(ETHOSU_PMU_SRAM_WR_DATA_BEAT_WRITTEN)Number of data beats the NPU wrote back to SRAM (e.g., outputs or intermediate results).Reflects output bandwidth usage.
ethosu_pmu_cntr2External DRAM read beats(ETHOSU_PMU_EXT_RD_DATA_BEAT_RECEIVED)Number of data beats read from off-chip memory (e.g., DRAM). Often 0 if Sram_Only is used.If non-zero, may indicate cache misses or large model size.
ethosu_pmu_cntr3External DRAM write beats(ETHOSU_PMU_EXT_WR_DATA_BEAT_WRITTEN)Number of write data beats to external memory.Helps detect offloading or insufficient SRAM.
ethosu_pmu_cntr4Idle cycles(ETHOSU_PMU_NPU_IDLE)Number of cycles where the NPU had no work scheduled (i.e., idle).High idle count = possible pipeline stalls or bad scheduling.

Review

In this Learning Path, you have learned how to deploy a MobileNet V2 model using ExecuTorch on Arm’s Corstone-320 FVP. You’re now ready to apply what you’ve learned to other models and configurations using ExecuTorch.

Back
Next