About this Learning Path

Skill level:	Advanced
Reading time:	1 hr 30 min
Last updated:	13 Mar 2026

Skill level:

Advanced

Reading time:

1 hr 30 min

Last updated:

13 Mar 2026

Author:	Jason Zhu, Tyler Mullenbach, Damien Dooley
Arm IP:	Cortex-A Arm C1
Tags:	ML macOS Android ExecuTorch Python CMake SME2

Author:

Jason Zhu, Tyler Mullenbach, Damien Dooley

Arm IP:

Cortex-A Arm C1

Tags:

macOS

Android

ExecuTorch

Python

CMake

SME2

Who is this for?

This is an advanced topic for developers and performance engineers who deploy ExecuTorch models on Arm devices and want to understand and reduce inference latency.

What will you learn?

Upon completion of this Learning Path, you will be able to:

Understand how SME2 acceleration changes the performance profile of ExecuTorch models by reducing compute-bound bottlenecks
Interpret operator-level and operator-category breakdowns (for example, convolution, GEMM, data movement, and other operators)
Identify which operators benefit most from SME2 acceleration and which operators become the new performance bottlenecks
Apply a model-agnostic profiling workflow that you reuse across different models and deployments
Make evidence-based optimization decisions by comparing execution profiles with SME2 enabled and disabled

Prerequisites

Before starting, you will need the following:

An Apple Silicon macOS host with Python 3.9 or later and CMake 3.29 or later
Basic familiarity with ExecuTorch or PyTorch
Optionally, an Android device with Armv9 and SME2 support for on-device testing (if used, configure power management settings to ensure consistent performance measurements)

Profile ExecuTorch models with SME2 on Arm

Introduction

Explore ExecuTorch profiling with SME2

Set up the ExecuTorch profiling environment

Export PyTorch models and analyze performance

Automate profiling workflows with AI agents

Next Steps

Profile ExecuTorch models with SME2 on Arm

About this Learning Path

Who is this for?

What will you learn?

Prerequisites