About this Learning Path

Who is this for?

This is an advanced topic for developers, performance engineers, and ML framework contributors who want to benchmark and optimize KleidiAI micro-kernels within ExecuTorch to accelerate model inference on Arm64 platforms supporting SME/SME2 instructions.

What will you learn?

Upon completion of this Learning Path, you will be able to:

  • Cross-compile ExecuTorch for Arm64 with XNNPACK and KleidiAI enabled, including SME/SME2 instructions
  • Build and export ExecuTorch models that can be accelerated by KleidiAI using SME/SME2 instructions
  • Use the executor_runner tool to run kernel workloads and collect ETDump profiling data.
  • Inspect and analyze ETRecord and ETDump files using the ExecuTorch Inspector API to understand kernel-level performance behavior.

Prerequisites

Before starting, you will need the following:

  • An x86_64 Linux host machine running Ubuntu, with at least 15 GB of free disk space
  • An Arm64 target system with support for SME or SME2 - see the Learning Path Devices with native SME2 support
Next