This section walks you through the end-to-end workflow you repeat for each model: onboarding and export to .pte, running the two-run performance analysis (timing-only and trace-enabled), and analyzing operator-level results. Model-specific work usually appears during onboarding and export; after you produce a .pte file, the pipeline is model-agnostic.
For a concrete example of model onboarding, see
EfficientSAM
in executorch/examples/models/, which demonstrates the full onboarding process including input/output handling, quantization, and backend delegation.
Model onboarding prepares the model for export and accurate profiling. Typical steps include defining or wrapping a PyTorch model so it is exportable (implements EagerModelBase interface), defining representative example inputs (shapes/dtypes that match your use case), refactoring model code to avoid unsupported operations or dynamic control flow (if needed), and choosing export dtype and quantization.
executor_runner can use default all-one input tensors if no input is provided, but for accurate performance analysis you should define representative inputs that match your actual use case. The export step uses these example inputs to capture the model graph with correct shapes and dtypes.Export dtype and quantization options:
Real onboarding work often requires small wrapper layers (normalization/scaling), operator substitutions, and shape stabilization.
To add a new model to the performance analysis workflow:
model_profiling/models/<model_name>/
directoryEagerModelBase interface in model.py:get_eager_model(): Returns the PyTorch modelget_example_inputs(): Returns tuple of example tensorsmodel_profiling/models/__init__.py
:
from . import your_model
register_model("your_model", YourModelClass)
The key insight: This registration system lets you add models without touching ExecuTorch source code. The exporter patches the registry at runtime, so your models appear alongside ExecuTorch’s built-in models.
Advanced onboarding example: The EdgeTAM image encoder (see agent skill
08_onboard_edgetam.md
for complete onboarding workflow) demonstrates:
Exporting a model to .pte format is model-specific work. This code repository used for this learning path includes a reference exporter at
model_profiling/export/export_model.py
that demonstrates the export process.
The exporter uses a registry patching system that allows local models (in model_profiling/models/) to be exported without editing ExecuTorch source code. This keeps the pipeline stable while you add new models.
Supported models:
toy_cnn (no extra deps; good for smoke tests)mobilenet_v3_small (requires torchvision; otherwise it falls back to toy_cnn)Example export command:
python model_profiling/export/export_model.py \
--model toy_cnn \
--dtype fp16 \
--outdir model_profiling/out_toy_cnn/artifacts/
Expected outputs:
model_profiling/out_toy_cnn/artifacts/toy_cnn_xnnpack_fp16.ptemodel_profiling/out_toy_cnn/artifacts/toy_cnn_xnnpack_fp16.pte.etrecord (optional; operator metadata)Keep exported artifacts under out_
For complete analysis you must run two experiment types:
Run timing-only for latency measurement, and run trace-enabled only when you need kernel-level details.
Experiments are defined with JSON configs. Minimal required fields:
model: path to .pteoutput_root: where to write run artifactsexperiments[]: a list of runs (SME2 on/off, threads, runs, runner_path)Use template configs in model_profiling/configs/templates/ and examples in model_profiling/configs/examples/ as starting points.
To inspect a template or example config:
cat model_profiling/configs/templates/mac_template.json
Or use the provided example:
cat model_profiling/configs/examples/mac_mobilenet_fp16.json
Important fields to set:
experiments[].runner_path: Points to the runner binary in executorch/cmake-out/<preset>/executor_runner (timing-only or trace-enabled)experiments[].num_runs: Number of iterations (default 10 for statistical significance)experiments[].num_threads: Thread count (usually 1 for single-threaded performance analysis)JSON configs are version-controlled and reproducible, and they let you run batches of experiments with consistent parameters.
The pipeline supports both Android (representative device testing) and macOS (developer accessibility). The same pipeline scripts are used; only runner binaries and environment differ.
For testing on real Arm edge mobile devices, Android runs are essential. This is the only way to validate performance on actual hardware with real thermal constraints, memory bandwidth, and device-specific optimizations.
Critical: Device power management settings - Power management settings significantly impact inference latency measurements. Configure your device for consistent performance mode and keep settings consistent across runs. Inconsistent power settings can cause measurement variance that obscures actual performance differences.
Configure device settings before running:
Option 1: Unconstrained boost mode (for stress testing hardware - removes all CPU frequency caps and scaling policies):
adb shell settings put global low_power 0
adb shell settings put global stay_on_while_plugged_in 15
adb shell svc power stayon true
adb shell settings put global window_animation_scale 0
adb shell settings put global transition_animation_scale 0
adb shell settings put global animator_duration_scale 0
These settings disable low power mode and UI animations, and force the device to stay on while plugged.
Option 2: App developer mode (for perceived app performance - real user daily experience):
adb shell settings put global stay_on_while_plugged_in 1
adb shell settings put global window_animation_scale 1
adb shell settings put global transition_animation_scale 1
adb shell settings put global animator_duration_scale 1
adb shell settings put global low_power 0
These settings disable low power mode, keep UI animations enabled, and use normal device settings.
Verify settings:
adb shell settings get global stay_on_while_plugged_in
adb shell settings get global low_power
adb shell dumpsys power | grep -i "stay.*awake\|thermal"
Unconstrained boost mode (value 15) removes all CPU frequency caps and scaling policies for stress testing hardware. App developer mode (value 1) reflects normal device usage and perceived app performance for real user daily experience. Choose based on your measurement goals and keep settings consistent across all runs.
Run the Android pipeline (example):
python model_profiling/scripts/android_pipeline.py --config model_profiling/configs/examples/android_mobilenet_fp16.json
python model_profiling/scripts/analyze_results.py --run-dir model_profiling/out_toy_cnn/runs/android
The pipeline uses adb to push and execute runners on-device, pulls ETDump traces back to the host, and runs automatic analysis.
The output is similar to:
out_<model>/runs/android/
android_sme2_on/
<model>_android_sme2_on_t1.etdump # Primary: operator-level trace data
<model>_android_sme2_on_t1.log # Runner stdout/stderr
android_sme2_off/
<model>_android_sme2_off_t1.etdump # Primary: operator-level trace data
<model>_android_sme2_off_t1.log # Runner stdout/stderr
manifest.json # Optional: run metadata (ExecuTorch SHA, config path, model path)
metrics.json # Optional: summary latencies (duplicate of manifest results)
The workflow: Same pipeline, different platform. The config points to Android runners, and the script uses adb to push binaries and run on device. ETDump traces are pulled back to the host for analysis.
Expected outcome: You’ll see similar operator-category breakdowns, but with real device timing. Often, data movement bottlenecks are more pronounced on real hardware due to memory bandwidth constraints.
For kernel-level insights on Android, run again with trace-enabled runners (separate config pointing to trace-enabled runner builds). Trace logging impacts timing, so these runs are only for kernel analysis, not latency measurement.
# Use a separate config that points to trace-enabled Android runners
# Note: Create a trace config by copying an existing config and updating runner_path to point to trace-enabled runners
python model_profiling/scripts/android_pipeline.py --config model_profiling/configs/examples/android_mobilenet_fp16.json
For development and initial testing, run on macOS. This is fast and convenient, but not representative of real device performance.
Timing-only run example:
python model_profiling/scripts/mac_pipeline.py --config model_profiling/configs/examples/mac_mobilenet_fp16.json
Trace-enabled (kernel analysis) run example — use a config that points to trace-enabled runners:
python model_profiling/scripts/mac_pipeline.py --config model_profiling/configs/examples/mac_mobilenet_fp16.json
Validate run artifacts (optional):
python model_profiling/scripts/validate_results.py --results model_profiling/out_toy_cnn/runs/mac
Validation checks for JSON files, but they’re not required for analysis. The analysis script works directly with ETDump files.
After a run, the output is similar to:
out_<model>/runs/mac/
mac_sme2_on/
<model>_mac_sme2_on_t1.etdump # Primary: operator-level trace data
<model>_mac_sme2_on_t1.log # Runner stdout/stderr
mac_sme2_off/
<model>_mac_sme2_off_t1.etdump # Primary: operator-level trace data
<model>_mac_sme2_off_t1.log # Runner stdout/stderr
manifest.json # Optional: run metadata (ExecuTorch SHA, config path, model path)
metrics.json # Optional: summary latencies (duplicate of manifest results)
The .etdump files are the primary data source. End-to-end latency numbers and operator-level breakdowns are derived from them. The JSON files (manifest.json, metrics.json) are optional metadata/logs for reproducibility and convenience, but analysis scripts work directly with ETDump.
You’ve now completed the full profiling workflow: exporting a model to .pte format, running timing and trace experiments on both platforms, and collecting ETDump traces. You’re ready to analyze the results and understand where SME2 delivers performance gains.
Next, you’ll analyze operator-level performance and identify optimization opportunities.
The pipeline runs analyze_results.py automatically after measurement. To rerun analysis manually:
python model_profiling/scripts/analyze_results.py \
--run-dir model_profiling/out_toy_cnn/runs/mac
The analysis generates model_profiling/out_toy_cnn/runs/mac/analysis_summary.json (operator-category breakdown) and CSV files in the same directory as ETDump files (if not already generated by pipeline).
Operator categories:
After SME2 accelerates CONV/GEMM, you’ll see data movement become a larger percentage of total time. This is expected, as SME2 reveals the next bottleneck. The visualization makes it obvious where to optimize next.
Before SME2: CONV/GEMM dominate timing (for example, 60-80% of total time)
Typical outcome after SME2:
If data movement dominates after SME2, focus optimizations on transpose elimination to reduce layout changes, layout optimization to choose layouts that minimize copies, and memory access patterns to reduce cache misses.
Use trace-enabled runs to confirm which kernel variants were selected (for example, SME2-enabled kernels vs NEON). Remember: trace-enabled runs are evidence-gathering only and shouldn’t be used for direct latency comparisons.