About this Learning Path

Skill level:	Advanced
Reading time:	30 min
Last updated:	18 Dec 2025

Skill level:

Advanced

Reading time:

30 min

Last updated:

18 Dec 2025

Author:	Qixiang Xu, Arm
Arm IP:	Cortex-A
Tags:	ML Linux Python ExecuTorch XNNPACK KleidiAI

Author:

Qixiang Xu, Arm

Arm IP:

Cortex-A

Tags:

Linux

Python

ExecuTorch

XNNPACK

KleidiAI

Who is this for?

This is an advanced topic for developers, performance engineers, and ML framework contributors who want to benchmark and optimize KleidiAI micro-kernels within ExecuTorch to accelerate model inference on Arm64 platforms supporting SME/SME2 instructions.

What will you learn?

Upon completion of this Learning Path, you will be able to:

Cross-compile ExecuTorch for Arm64 with XNNPACK and KleidiAI enabled, including SME/SME2 instructions
Build and export ExecuTorch models that can be accelerated by KleidiAI using SME/SME2 instructions
Use the executor_runner tool to run kernel workloads and collect ETDump profiling data.
Inspect and analyze ETRecord and ETDump files using the ExecuTorch Inspector API to understand kernel-level performance behavior.

Prerequisites

Before starting, you will need the following:

An x86_64 Linux host machine running Ubuntu, with at least 15 GB of free disk space
An Arm64 target system with support for SME or SME2 - see the Learning Path Devices with native SME2 support

Benchmark a KleidiAI micro-kernel in ExecuTorch

Introduction

Set up your environment

Cross-Compile ExecuTorch for the AArch64 platform

Accelerate ExecuTorch operators with KleidiAI micro-kernels

Create and quantize linear layer benchmark model

Create and quantize convolution layer benchmark model

Create matrix multiply layer benchmark model

Run model and generate the ETDump

Analyze ETRecord and ETDump

Next Steps

Benchmark a KleidiAI micro-kernel in ExecuTorch

About this Learning Path

Who is this for?

What will you learn?

Prerequisites