Verify BOLT optimization results

Optimize AArch64 binaries with LLVM BOLT

Log an issue

Fork and edit

Discuss on Discord

Optimize AArch64 binaries with LLVM BOLT

Verify optimization with runtime

Note

The example below uses a BRBE profiling optimized binary. You can apply the same verification steps to binaries optimized using the other BOLT profiling methods.

First, compare the runtime of the original and optimized BubbleSort binaries. A shorter runtime provides an initial indication that BOLT improved the code layout.

    

        
        time out/bsort
  Bubble sorting 10000 elements
  280 ms (first=100669 last=2147469841)
  out/bsort  0.28s user 0.00s system 99% cpu 0.282 total
time out/bsort.opt.brbe
  Bubble sorting 10000 elements
  147 ms (first=100669 last=2147469841)
  out/bsort.opt.brbe  0.15s user 0.00s system 99% cpu 0.148 total

In this example, the optimized binary runs in about 147 ms, compared with 280 ms for the original binary. This corresponds to roughly a 2× speedup. The improvement is large because the example program intentionally creates poor code locality. Real applications typically show smaller but still meaningful improvements after BOLT optimization.

Verify optimization with hardware metrics

Next, apply the TopDown Methodology again to verify that BOLT improved the code layout. The runtime comparison shows the performance impact, but the TopDown metrics reveal how the optimization affects processor behavior. Run the same tool used earlier when evaluating whether the program was a good BOLT candidate. This time, run it on the optimized binary, for example, the BRBE-optimized version.

    

        
        
    topdown-tool ./out/bsort.opt.brbe
__output__      CPU Neoverse V1 metrics
__output__      ├── Stage 1 (Topdown metrics)
__output__      │   └── Topdown Level 1 (Topdown_L1)
__output__      │       └── ┏━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━┓
__output__      │           ┃ Metric          ┃ Value ┃ Unit ┃
__output__      │           ┡━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━┩
__output__      │           │ Backend Bound   │ 11.19 │ %    │
__output__      │           │ Bad Speculation │ 24.86 │ %    │
__output__      │         » │ Frontend Bound  │ 36.10 │ %    │ «
__output__      │           │ Retiring        │ 28.42 │ %    │
__output__      │           └─────────────────┴───────┴──────┘
__output__      └── Stage 2 (uarch metrics)
__output__          ├── Misses Per Kilo Instructions (MPKI)
__output__          │   └── ┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
__output__          │       ┃ Metric                  ┃ Value ┃ Unit                          ┃
__output__          │       ┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
__output__          │       │ Branch MPKI             │ 9.799 │ misses per 1,000 instructions │
__output__          │     » │ L1I Cache MPKI          │ 0.019 │ misses per 1,000 instructions │ «
__output__          │       └─────────────────────────┴───────┴───────────────────────────────┘
__output__          ...

    

        
        
    perf stat -e instructions,L1-icache-misses:u ./out/bsort.opt.brbe
__output__      Performance counter stats for './out/bsort.opt.brbe':
__output__
__output__          982204165 instructions
__output__               3807 L1-icache-misses
__output__
__output__        0.147606245 seconds time elapsed
__output__
__output__        0.147644000 seconds user
__output__        0.000000000 seconds sys

Compare these metrics with the earlier results collected from the original binary. After optimization, both frontend bound and L1I MPKI should decrease.

In this example, the optimized program is 36% frontend bound, down from 55%. The L1I cache MPKI drops to nearly 0, which indicates a significant improvement in instruction locality.

This value is unusually low because the example program intentionally creates poor code locality.

The Branch MPKI also decreases—from 16 to about 10—because BOLT can improve branch prediction. It uses profile data to adjust code layout and swap fall-through and taken paths when beneficial.

You can also compute the MPKI values manually using perf stat, as described in the Good BOLT Candidates section.

Summary

In this learning path, you learned how to use BOLT to optimize binary code layout using several profiling methods. The optimized binaries improved instruction locality, reduced frontend stalls, and delivered measurable performance gains.

Back

Optimize AArch64 binaries with LLVM BOLT

Introduction

Understand BOLT optimization for Arm

Prepare your environment for BOLT

Identify programs for BOLT optimization

Optimize with BRBE profiling

Optimize with instrumentation profiling

Optimize with SPE profiling

Optimize with PMU profiling

Verify BOLT optimization results

Next Steps

Optimize AArch64 binaries with LLVM BOLT

Verify optimization with runtime

Verify optimization with hardware metrics

Summary