Introduction
Set up your environment
Cross-Compile ExecuTorch for the AArch64 platform
Accelerate ExecuTorch operators with KleidiAI micro-kernels
Create and quantize linear layer benchmark model
Create and quantize convolution layer benchmark model
Create matrix multiply layer benchmark model
Run model and generate the ETDump
Analyze ETRecord and ETDump
Next Steps
From your x86_64 host (where you cross-compiled), copy the runner and exported models to the Arm device:
scp $WORKSPACE/build-arm64/executor_runner <arm_user>@<arm_host>:~/bench/
scp -r model/ <arm_user>@<arm_host>:~/bench/
Use one of the models you exported earlier (e.g., FP32 linear: linear_model_pf32_gemm.pte). The flags below tell executor_runner where to write the ETDump and how many times to execute.
cd ~/bench
./executor_runner -etdump_path model/linear_model_pf32_gemm.etdump -model_path model/linear_model_pf32_gemm.pte -num_executions=1 -cpu_threads 1
You can adjust the number of execution threads and the number of times the model is invoked.
You should see logs like:
D 00:00:00.015988 executorch:XNNPACKBackend.cpp:57] Creating XNN workspace
D 00:00:00.018719 executorch:XNNPACKBackend.cpp:69] Created XNN workspace: 0xaff21c2323e0
D 00:00:00.027595 executorch:operator_registry.cpp:96] Successfully registered all kernels from shared library: NOT_SUPPORTED
I 00:00:00.035506 executorch:executor_runner.cpp:157] Resetting threadpool with num threads = 1
I 00:00:00.048120 executorch:threadpool.cpp:48] Resetting threadpool to 1 threads.
I 00:00:00.051509 executorch:executor_runner.cpp:218] Model file model/linear_model_f32.pte is loaded.
I 00:00:00.051531 executorch:executor_runner.cpp:227] Using method forward
I 00:00:00.051541 executorch:executor_runner.cpp:278] Setting up planned buffer 0, size 2112.
D 00:00:00.051630 executorch:method.cpp:793] Loading method: forward.
....
D 00:00:00.091432 executorch:XNNExecutor.cpp:236] Resizing output tensor to a new shape
I 00:00:00.091459 executorch:executor_runner.cpp:340] Model executed successfully 1 time(s) in 2.904883 ms.
I 00:00:00.091477 executorch:executor_runner.cpp:349] 1 outputs:
OutputX 0: tensor(sizes=[1, 256], [
0.0106399, 0.0951964, 1.04854, -0.290168, -0.278126, -0.355151, 0.0583736, -0.431953, -0.0773305, -0.32844,
...,
0.553568, -0.0339369, 0.562088, -1.21021, -0.769254, 0.677771, -0.264338, 1.05453, 0.724467, 0.53182,
])
I 00:00:00.093912 executorch:executor_runner.cpp:125] ETDump written to file 'model/linear_model_f32.etdump'.
If execution succeeds, an ETDump file is created next to your model. You will load the .etdump in the next section and analyze which operators dispatched to KleidiAI and how each micro-kernel performed.