About this Learning Path

Who is this for?

This is an advanced topic for developers and performance engineers who deploy ExecuTorch models on Arm devices and want to understand and reduce inference latency.

What will you learn?

Upon completion of this Learning Path, you will be able to:

  • Understand how SME2 acceleration changes the performance profile of ExecuTorch models by reducing compute-bound bottlenecks
  • Interpret operator-level and operator-category breakdowns (for example, convolution, GEMM, data movement, and other operators)
  • Identify which operators benefit most from SME2 acceleration and which operators become the new performance bottlenecks
  • Apply a model-agnostic profiling workflow that you reuse across different models and deployments
  • Make evidence-based optimization decisions by comparing execution profiles with SME2 enabled and disabled

Prerequisites

Before starting, you will need the following:

  • An Apple Silicon macOS host with Python 3.9 or later and CMake 3.29 or later
  • Basic familiarity with ExecuTorch or PyTorch
  • Optionally, an Android device with Armv9 and SME2 support for on-device testing (if used, configure power management settings to ensure consistent performance measurements)
Next