Run model and generate the ETDump

Benchmark a KleidiAI micro-kernel in ExecuTorch

Log an issue

Fork and edit

Discuss on Discord

Benchmark a KleidiAI micro-kernel in ExecuTorch

Copy artifacts to your Arm64 target

From your x86_64 host (where you cross-compiled), copy the runner and exported models to the Arm device:

    

        
        
scp $WORKSPACE/build-arm64/executor_runner <arm_user>@<arm_host>:~/bench/
scp -r model/ <arm_user>@<arm_host>:~/bench/

Run a model and emit ETDump

Use one of the models you exported earlier (e.g., FP32 linear: linear_model_pf32_gemm.pte). The flags below tell executor_runner where to write the ETDump and how many times to execute.

    

        
        
cd ~/bench
./executor_runner -etdump_path model/linear_model_pf32_gemm.etdump -model_path model/linear_model_pf32_gemm.pte -num_executions=1 -cpu_threads 1

You can adjust the number of execution threads and the number of times the model is invoked.

You should see logs like:

    

        
        D 00:00:00.015988 executorch:XNNPACKBackend.cpp:57] Creating XNN workspace
D 00:00:00.018719 executorch:XNNPACKBackend.cpp:69] Created XNN workspace: 0xaff21c2323e0
D 00:00:00.027595 executorch:operator_registry.cpp:96] Successfully registered all kernels from shared library: NOT_SUPPORTED
I 00:00:00.035506 executorch:executor_runner.cpp:157] Resetting threadpool with num threads = 1
I 00:00:00.048120 executorch:threadpool.cpp:48] Resetting threadpool to 1 threads.
I 00:00:00.051509 executorch:executor_runner.cpp:218] Model file model/linear_model_f32.pte is loaded.
I 00:00:00.051531 executorch:executor_runner.cpp:227] Using method forward
I 00:00:00.051541 executorch:executor_runner.cpp:278] Setting up planned buffer 0, size 2112.
D 00:00:00.051630 executorch:method.cpp:793] Loading method: forward.
....

D 00:00:00.091432 executorch:XNNExecutor.cpp:236] Resizing output tensor to a new shape
I 00:00:00.091459 executorch:executor_runner.cpp:340] Model executed successfully 1 time(s) in 2.904883 ms.
I 00:00:00.091477 executorch:executor_runner.cpp:349] 1 outputs:
OutputX 0: tensor(sizes=[1, 256], [
  0.0106399, 0.0951964, 1.04854, -0.290168, -0.278126, -0.355151, 0.0583736, -0.431953, -0.0773305, -0.32844,
  ...,
  0.553568, -0.0339369, 0.562088, -1.21021, -0.769254, 0.677771, -0.264338, 1.05453, 0.724467, 0.53182,
])
I 00:00:00.093912 executorch:executor_runner.cpp:125] ETDump written to file 'model/linear_model_f32.etdump'.

If execution succeeds, an ETDump file is created next to your model. You will load the .etdump in the next section and analyze which operators dispatched to KleidiAI and how each micro-kernel performed.

Back

Benchmark a KleidiAI micro-kernel in ExecuTorch

Introduction

Set up your environment

Cross-Compile ExecuTorch for the AArch64 platform

Accelerate ExecuTorch operators with KleidiAI micro-kernels

Create and quantize linear layer benchmark model

Create and quantize convolution layer benchmark model

Create matrix multiply layer benchmark model

Run model and generate the ETDump

Analyze ETRecord and ETDump

Next Steps

Benchmark a KleidiAI micro-kernel in ExecuTorch

Copy artifacts to your Arm64 target

Run a model and emit ETDump