Identify programs for BOLT optimization

Optimize AArch64 binaries with LLVM BOLT

Log an issue

Fork and edit

Discuss on Discord

Optimize AArch64 binaries with LLVM BOLT

What makes a program a good BOLT candidate?

Hardware performance metrics can help determine whether a program is a good candidate for code layout optimization with BOLT. Developers often analyze these metrics using methodologies such as the Arm TopDown methodology .

You will focus on a small set of TopDown indicators related to instruction delivery and code locality. These indicators describe how efficiently the processor fetches instructions and keeps the execution pipeline busy.

When instruction delivery is inefficient, the workload is referred to as front-end bound, meaning the CPU often waits for instructions instead of executing them. This usually points to instruction fetch or code layout issues, where improving code layout can help.

The L1 instruction cache (L1 I-cache) is the first and fastest cache used to store instructions close to the CPU. When instructions are not found there, the CPU must fetch them from slower memory, which can stall execution. MPKI, short for misses per kilo instructions, measures how often an event misses per 1,000 executed instructions, which makes it easier to compare across programs and workloads. A high L1 I-cache MPKI usually indicates poor instruction locality in the binary.

Based on these observations, the BOLT community typically considers a program a good candidate for layout optimization when:

The workload is more than 10% front-end bound
The L1I cache misses per kilo instructions (MPKI) exceeds 30

Higher branch mispredictions or I-TLB misses can also indicate that code layout optimization may improve performance.

Collecting the metrics

You can collect these metrics using the Topdown Methodology (see installation guide ) which builds on the Linux perf profiling tool.

Alternatively, you can compute only the L1 I-cache MPKI metric manually using a basic Linux perf stat command.

    

        
        
    topdown-tool ./out/bsort
__output__      CPU Neoverse V1 metrics
__output__      ├── Stage 1 (Topdown metrics)
__output__      │   └── Topdown Level 1 (Topdown_L1)
__output__      │       └── ┏━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━┓
__output__      │           ┃ Metric          ┃ Value ┃ Unit ┃
__output__      │           ┡━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━┩
__output__      │           │ Backend Bound   │ 11.77 │ %    │
__output__      │           │ Bad Speculation │ 17.92 │ %    │
__output__      │         » │ Frontend Bound  │ 55.73 │ %    │ «
__output__      │           │ Retiring        │ 14.88 │ %    │
__output__      │           └─────────────────┴───────┴──────┘
__output__      └── Stage 2 (uarch metrics)
__output__          ├── Misses Per Kilo Instructions (MPKI)
__output__          │   └── ┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
__output__          │       ┃ Metric                  ┃ Value  ┃ Unit                          ┃
__output__          │       ┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
__output__          │       │ Branch MPKI             │ 16.583 │ misses per 1,000 instructions │
__output__          │     » │ L1I Cache MPKI          │ 60.408 │ misses per 1,000 instructions │ «
__output__          │       └─────────────────────────┴────────┴───────────────────────────────┘
__output__          ...

    

        
        
    perf stat -e instructions,L1-icache-misses:u ./out/bsort
__output__      Performance counter stats for './out/bsort':
__output__
__output__          957828603 instructions
__output__           58003648 L1-icache-misses
__output__
__output__        0.282472631 seconds time elapsed
__output__
__output__        0.282541000 seconds user
__output__        0.000000000 seconds sys

Interpreting the results

In this example, the program is 55% front-end bound, which indicates that the processor frequently stalls while waiting for instructions. At Stage 2, the microarchitectural metrics report an L1I cache MPKI of about 60, which strongly suggests poor instruction locality. This value exceeds the typical threshold of 30 MPKI for good BOLT candidates.

The branch MPKI of 16 also indicates frequent branch mispredictions, which code layout optimization may improve.

Computing MPKI manually

The topdown-tool collects performance counters using perf and applies formulas to derive higher-level metrics.

To compute the L1I cache MPKI manually from the perf stat output, apply the following formula:

$$\frac{(\text{L1-icache-misses} \times 1000)}{\text{instructions}}$$

What you’ve learned and what’s next

You’ve learned how to evaluate whether a program is a good candidate for BOLT optimization by analyzing frontend stalls and L1I cache MPKI. The example program shows clear signs of poor instruction locality with 55% frontend bound and an L1I MPKI of 60.

In the following sections, you’ll explore different profiling methods to collect the data BOLT needs for optimization, starting with BRBE profiling.

Back

Optimize AArch64 binaries with LLVM BOLT

Introduction

Understand BOLT optimization for Arm

Prepare your environment for BOLT

Identify programs for BOLT optimization

Optimize with BRBE profiling

Optimize with instrumentation profiling

Optimize with SPE profiling

Optimize with PMU profiling

Verify BOLT optimization results

Next Steps

Optimize AArch64 binaries with LLVM BOLT

What makes a program a good BOLT candidate?

Collecting the metrics

Interpreting the results

Computing MPKI manually

What you’ve learned and what’s next