The example below uses a BRBE profiling optimized binary. You can apply the same verification steps to binaries optimized using the other BOLT profiling methods.
First, compare the runtime of the original and optimized BubbleSort binaries. A shorter runtime provides an initial indication that BOLT improved the code layout.
time out/bsort
Bubble sorting 10000 elements
280 ms (first=100669 last=2147469841)
out/bsort 0.28s user 0.00s system 99% cpu 0.282 total
time out/bsort.opt.brbe
Bubble sorting 10000 elements
147 ms (first=100669 last=2147469841)
out/bsort.opt.brbe 0.15s user 0.00s system 99% cpu 0.148 total
In this example, the optimized binary runs in about 147 ms, compared with 280 ms for the original binary. This corresponds to roughly a 2× speedup. The improvement is large because the example program intentionally creates poor code locality. Real applications typically show smaller but still meaningful improvements after BOLT optimization.
Next, apply the TopDown Methodology again to verify that BOLT improved the code layout. The runtime comparison shows the performance impact, but the TopDown metrics reveal how the optimization affects processor behavior. Run the same tool used earlier when evaluating whether the program was a good BOLT candidate. This time, run it on the optimized binary, for example, the BRBE-optimized version.
topdown-tool ./out/bsort.opt.brbe
__output__ CPU Neoverse V1 metrics
__output__ ├── Stage 1 (Topdown metrics)
__output__ │ └── Topdown Level 1 (Topdown_L1)
__output__ │ └── ┏━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━┓
__output__ │ ┃ Metric ┃ Value ┃ Unit ┃
__output__ │ ┡━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━┩
__output__ │ │ Backend Bound │ 11.19 │ % │
__output__ │ │ Bad Speculation │ 24.86 │ % │
__output__ │ » │ Frontend Bound │ 36.10 │ % │ «
__output__ │ │ Retiring │ 28.42 │ % │
__output__ │ └─────────────────┴───────┴──────┘
__output__ └── Stage 2 (uarch metrics)
__output__ ├── Misses Per Kilo Instructions (MPKI)
__output__ │ └── ┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
__output__ │ ┃ Metric ┃ Value ┃ Unit ┃
__output__ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
__output__ │ │ Branch MPKI │ 9.799 │ misses per 1,000 instructions │
__output__ │ » │ L1I Cache MPKI │ 0.019 │ misses per 1,000 instructions │ «
__output__ │ └─────────────────────────┴───────┴───────────────────────────────┘
__output__ ...
perf stat -e instructions,L1-icache-misses:u ./out/bsort.opt.brbe
__output__ Performance counter stats for './out/bsort.opt.brbe':
__output__
__output__ 982204165 instructions
__output__ 3807 L1-icache-misses
__output__
__output__ 0.147606245 seconds time elapsed
__output__
__output__ 0.147644000 seconds user
__output__ 0.000000000 seconds sys
Compare these metrics with the earlier results collected from the original binary. After optimization, both frontend bound and L1I MPKI should decrease.
In this example, the optimized program is 36% frontend bound, down from 55%. The L1I cache MPKI drops to nearly 0, which indicates a significant improvement in instruction locality.
This value is unusually low because the example program intentionally creates poor code locality.
The Branch MPKI also decreases—from 16 to about 10—because BOLT can improve branch prediction. It uses profile data to adjust code layout and swap fall-through and taken paths when beneficial.
You can also compute the MPKI values manually using perf stat, as described in the
Good BOLT Candidates
section.
In this learning path, you learned how to use BOLT to optimize binary code layout using several profiling methods. The optimized binaries improved instruction locality, reduced frontend stalls, and delivered measurable performance gains.