Set up the ExecuTorch profiling environment

Profile ExecuTorch models with SME2 on Arm

Log an issue

Fork and edit

Discuss on Discord

Profile ExecuTorch models with SME2 on Arm

Set up your workspace and build ExecuTorch runners

This section covers the one-time setup required to run the performance analysis pipeline. Once completed, you can reuse this setup for all the models you analyze. The pipeline is model-agnostic: after exporting a model to .pte format, the same runners, scripts, and analysis steps apply regardless of model architecture. Only the model export step is model-specific.

You will complete two tasks. First, set up the development environment and install ExecuTorch from source. Second, build ExecuTorch runner binaries (SME2 on and SME2 off).

Perform this setup once and reuse it across all subsequent analyses.

Organize your profiling workspace

The profiling scripts expect a consistent directory layout. The structure below is created automatically as you run the setup and build scripts.

    

        
        
<executorch_sme2_kit_root>/
├── model_profiling/
│   ├── scripts/        # export + run + analyze entrypoints
│   ├── configs/        # JSON configs (templates + examples)
│   ├── export/         # model export script
│   ├── models/         # model source code (model definitions, registration)
│   ├── out_<model>/    # exported artifacts per model (created during export)
│   │   ├── artifacts/     # .pte files and optional .etrecord
│   │   └── runs/          # run outputs for this model (etdump, metrics, reports)
│   ├── tools/          # analysis tools (ETDump conversion, operator categorization)
│   └── pipeline/        # pipeline internals
├── executorch/       # ExecuTorch checkout (created during setup; name can't be changed, CMake requires it)
│   └── cmake-out/      # CMake build outputs (created after building runners)
│       ├── android-arm64-v9a/executor_runner       # Android SME2-on runner (for mobile device testing, if built)
│       ├── android-arm64-v9a-sme2-off/executor_runner  # Android SME2-off runner (for mobile device testing, if built)
│       ├── mac-arm64/executor_runner              # macOS SME2-on runner (developer accessibility)
│       └── mac-arm64-sme2-off/executor_runner      # macOS SME2-off runner (developer accessibility)
└── .venv/            # Python virtual environment

The directory structure is designed for reliability and reproducibility:

executorch/ must be named exactly executorch/ - ExecuTorch’s CMake configuration uses hardcoded relative paths. Renaming this directory breaks the build.
executorch/cmake-out/ links runners to their ExecuTorch version - This makes it easy to correlate profiling results with the specific ExecuTorch checkout used to build the runners.
out_/ groups artifacts by model - Keeping exported models and profiling results together simplifies comparison and ensures reproducibility.

This layout also supports testing multiple ExecuTorch versions in parallel by maintaining separate workspaces.

Set up the environment

This step installs ExecuTorch from source and creates a Python virtual environment. It is required only once per workspace.

    

        
        
bash model_profiling/scripts/setup_repo.sh

The script creates and activates a Python virtual environment under .venv/, clones the ExecuTorch repository if it doesn’t already exist, updates the repository and initializes required submodules, and installs ExecuTorch in editable mode.

Editable installation ensures that changes to the ExecuTorch source tree (for example, switching branches or testing pull requests) are immediately reflected without reinstalling the package. This is essential when testing how new ExecuTorch changes affect model export and performance analysis.

Optional: Manual setup

If you prefer to set up the environment manually, use these commands:

    

        
        
python3 -m venv .venv
source .venv/bin/activate

if [ ! -d executorch ]; then
  git clone https://github.com/pytorch/executorch.git executorch
fi
cd executorch
git fetch origin main --depth 1
git checkout -B main origin/main

git submodule sync
git submodule update --init --recursive

pip install -e .
cd ..

After setup, verify that ExecuTorch is importable and submodules are present.

    

        
        
source .venv/bin/activate
python -c "import executorch; print(f'ExecuTorch: {executorch.__file__}')"

Check that the executorch directory and submodules exist:

    

        
        
ls -d executorch/
ls -d executorch/backends/xnnpack/third-party/XNNPACK

Build ExecuTorch runners with SME2 support

Next, build the ExecuTorch runner binaries used for profiling. These runners are model-agnostic and can execute any .pte file. Run the build script:

    

        
        
bash model_profiling/scripts/build_runners.sh

The script builds both SME2-enabled and SME2-disabled runners to support direct performance comparison.

This generates runners in executorch/cmake-out/:

Android runners :

executorch/cmake-out/android-arm64-v9a/executor_runner (SME2-on)
executorch/cmake-out/android-arm64-v9a-sme2-off/executor_runner (SME2-off)

Android builds are performed automatically if the ANDROID_NDK environment variable is set.

macOS runners :

executorch/cmake-out/mac-arm64/executor_runner (SME2-on)
executorch/cmake-out/mac-arm64-sme2-off/executor_runner (SME2-off)

macOS runners are included to make the workflow accessible without requiring a mobile device.

Why runners are built once: The runners are model-agnostic. They can execute any .pte file, so you build them once and reuse them for all models you analyze. You only need to rebuild if you change CMake build configurations (for example, enabling XNNPACK kernel trace logging for kernel-level analysis).

Version compatibility: Model export and runner builds must use the same ExecuTorch version. A .pte file exported with one ExecuTorch revision might not be compatible with a runner built from another revision. Keeping runners under executorch/cmake-out/ ensures this relationship remains explicit.

The two-run workflow: You need two types of runners for complete performance analysis. Timing-only runners (default) have no trace logging overhead and provide accurate latency measurements. XNNPACK kernel trace runners enable xnntrace logging for kernel-level insights (logging impacts timing, use only for kernel analysis).

The build script produces timing-only runners by default. For kernel-level analysis, you need separate trace-enabled runners built with XNNPACK kernel logging flags.

How it works: ExecuTorch ships with a default CMakePresets.json, but you can add custom presets for SME2 performance analysis (SME2-on/off variants, platform-specific configs). The build script merges your custom presets ( model_profiling/assets/cmake_presets.json ) into ExecuTorch’s default file, then uses cmake --preset commands. This approach keeps ExecuTorch’s defaults intact while adding our performance analysis-specific configurations. No manual CMake flags needed.

Example presets:

Android (for mobile device testing):

    

        
        
{
  "name": "android-arm64-v9a",
  "displayName": "Android arm64 with XNNPACK (SME2 ON, timing)",
  "generator": "Ninja",
  "binaryDir": "${sourceDir}/cmake-out/android-arm64-v9a",
  "cacheVariables": {
    "EXECUTORCH_BUILD_XNNPACK": "ON",
    "EXECUTORCH_BUILD_DEVTOOLS": "ON",
    "EXECUTORCH_BUILD_EXECUTOR_RUNNER": "ON",
    "EXECUTORCH_ENABLE_LOGGING": "ON",
    "EXECUTORCH_ENABLE_EVENT_TRACER": "ON",
    "EXECUTORCH_BUILD_KERNELS_QUANTIZED": "ON",
    "EXECUTORCH_BUILD_KERNELS_QUANTIZED_AOT": "ON",
    "EXECUTORCH_XNNPACK_ENABLE_KLEIDI": "ON",
    "CMAKE_BUILD_TYPE": "RelWithDebInfo"
  }
}

macOS (developer accessibility):

    

        
        
{
  "name": "mac-arm64",
  "displayName": "Mac arm64 with XNNPACK (SME2 ON, timing)",
  "generator": "Ninja",
  "binaryDir": "${sourceDir}/cmake-out/mac-arm64",
  "cacheVariables": {
    "EXECUTORCH_BUILD_XNNPACK": "ON",
    "EXECUTORCH_BUILD_DEVTOOLS": "ON",
    "EXECUTORCH_BUILD_EXECUTOR_RUNNER": "ON",
    "EXECUTORCH_ENABLE_LOGGING": "ON",
    "EXECUTORCH_ENABLE_EVENT_TRACER": "ON",
    "EXECUTORCH_BUILD_KERNELS_QUANTIZED": "ON",
    "EXECUTORCH_BUILD_KERNELS_QUANTIZED_AOT": "ON",
    "EXECUTORCH_XNNPACK_ENABLE_KLEIDI": "ON",
    "CMAKE_BUILD_TYPE": "RelWithDebInfo"
  }
}

Key settings explained:

EXECUTORCH_ENABLE_EVENT_TRACER: ON - Enables ETDump trace generation (required for performance analysis)
EXECUTORCH_BUILD_DEVTOOLS: ON - Enables performance analysis tools
EXECUTORCH_BUILD_EXECUTOR_RUNNER: ON - Builds the runner binary
EXECUTORCH_XNNPACK_ENABLE_KLEIDI: ON - Enables Arm KleidiAI kernels
EXECUTORCH_BUILD_KERNELS_QUANTIZED: ON - Enables quantized kernel support for INT8 models

Note: SME2 acceleration is enabled by default in XNNPACK when building for Arm architectures, so XNNPACK_ENABLE_ARM_SME2 is not needed in this example.

Understand the pipeline components

The performance analysis pipeline consists of three stages. First, execute the model under defined configurations. Second, collect timing data and ETDump traces. Third, analyze traces into operator-level and category-level summaries.

All stages are driven by JSON configuration files and are independent of model architecture.

Config-driven experiments

Each experiment is defined in a JSON configuration file. These configs specify:

Which runner to use (SME2 on/off, timing-only or trace-enabled)
Which .pte model to execute
Runtime parameters such as CPU thread count and warmup iterations
Logging level: Timing-only mode vs full ETDump trace mode (trace mode impacts timing, use for kernel-level insights)
Analysis comparisons: Which experiment pairs to compare (for example, SME2-on vs SME2-off)

This approach enables repeatable, systematic performance analysis across multiple configurations without modifying pipeline code. Example templates can be found in the model_profiling/config directory after the setup workspace has completed.

Pipeline scripts

The following scripts execute the pipeline:

model_profiling/scripts/android_pipeline.py - Android performance analysis pipeline
model_profiling/scripts/mac_pipeline.py - macOS performance analysis pipeline

These scripts read the experiment configuration, run all defined experiments, and collect ETDump traces and logs. Analysis is performed automatically after execution using model_profiling/scripts/analyze_results.py

Manual invocation is only required if you want to reprocess existing traces.

This script generates operator-level breakdowns, category-level summaries, and comparison reports. It parses structured data (ETDump, CSV, JSON) produced by the runners, regardless of which model generated them.

Output structure

All profiling runs follow a consistent output layout under out_<model>/runs/<platform>/<experiment_name>/:

etdump.etdp - Binary trace file with execution timing
etdump.json - Human-readable trace (converted from .etdp)
executor_runner.log - Runner console output
operator_stats.csv - Per-operator timing breakdown
category_stats.csv - Category-level aggregation
comparison_report.txt - Side-by-side experiment comparison (when applicable)

This structure makes it easy to compare results across models, track experiment history, and generate reports from structured outputs.

What you’ve accomplished and what’s next

You’ve completed the one-time setup: a working ExecuTorch development environment, SME2-enabled and SME2-disabled runner binaries, and a reusable, model-agnostic profiling pipeline. This setup applies to any model you analyze going forward.

Next, you’ll export a model, define experiments, and analyze how SME2 changes its performance profile.

Back

Profile ExecuTorch models with SME2 on Arm

Introduction

Explore ExecuTorch profiling with SME2

Set up the ExecuTorch profiling environment

Export PyTorch models and analyze performance

Automate profiling workflows with AI agents

Next Steps

Profile ExecuTorch models with SME2 on Arm

Set up your workspace and build ExecuTorch runners

Organize your profiling workspace

Set up the environment

Build ExecuTorch runners with SME2 support

Understand the pipeline components

Config-driven experiments

Pipeline scripts

Output structure

What you’ve accomplished and what’s next