A set of frequently asked questions is below.
If you can record ETM on your Arm Linux system you should use it. It will generate branch samples that can’t be recorded with samples or SPE and will provide maximum performance gains. If ETM is not available then use SPE. If neither is available then use samples.
You can use ETM AutoFDO which uses trace strobing to record small slices and will output much smaller files compared to ETM. See Using ETM AutoFDO section for more details. Otherwise you can use SPE or Samples. SPE also generates large files.
You should run the executable in the way that it is usually used. If it is run in a different way then BOLT’s optimizations may hurt instead of help.
You can either find an input that covers all possible input cases. If this is not possible there are 2 other solutions.
If you know which inputs go down which path you can record profiles for multiple cases and then create an optimized executable for each case. Then when you get input you need to pick which executable to run.
Another solution is to collect profiles for a range of inputs, generate a .fdata
file for each one using perf2bolt
and then combine these .fdata
files into one file and that can used to by BOLT.
For example, collect profiles with different inputs:
perf record -e cycles:u -o perf1.data -- ./executable abc
perf record -e cycles:u -o perf2.data -- ./executable def
perf record -e cycles:u -o perf3.data -- ./executable ghi
Convert each case to the .fdata
format:
perf2bolt -p perf1.data -o perf1.fdata -nl ./executable
perf2bolt -p perf2.data -o perf2.fdata -nl ./executable
perf2bolt -p perf3.data -o perf3.fdata -nl ./executable
Combine the .fdata
files into a single file:
merge-fdata *.fdata > combined.fdata
Run BOLT to optimize and create the new executable based on combined.fdata
:
llvm-bolt ./executable -o ./new_executable -data combined.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold -split-eh -dyno-stats
The easiest way to check is look at the names of the sections in the new executable because BOLT adds a few new sections with bolt in the name.
objdump -h ./new_executable | grep bolt
If you see the .bolt
sections you know this is the new executable:
12 .bolt.org.text 0000243c 0000000000000a80 0000000000000a80 00000a80 2**3
15 .bolt.org.eh_frame_hdr 00000184 00000000000034c8 00000000000034c8 000034c8 2**2
16 .bolt.org.eh_frame 0000056c 0000000000003650 0000000000003650 00003650 2**3
30 .note.bolt_info 000000b0 0000000000000000 0000000000000000 00404bfc 2**0
Check the original executable to see that these sections didn’t exist there before.
objdump -h ./executable | grep bolt
The output should be empty.
You can check if it takes less time to run. This is fine for applications which run for some time, performing some operations and then exit but if it is a service and is just waiting for new input there are other methods check it has been improved.
Use the time
command to see how long each executable took and see if it has reduced.
For example, run both executables:
time ./executable
time ./new_executable
real 1m54.578s
user 1m51.222s
sys 0m2.700s
real 1m30.844s
user 1m26.917s
sys 0m2.853s
The new_executable
ran for 23.8 seconds less.
Perf stat can be used to measure time and record event counters.
For example, run both executables:
perf stat -e task-clock,instructions,cycles,l1i_cache_refill,branch-misses -- ./executable
perf stat -e task-clock,instructions,cycles,l1i_cache_refill,branch-misses -- ./new_executable
Performance counter stats for './executable':
113326.10 msec task-clock # 0.994 CPUs utilized
617222198064 instructions # 2.09 insn per cycle
294642055705 cycles # 2.600 GHz
9510653199 l1i_cache_refill # 83.923 M/sec
852678168 branch-misses
114.041719245 seconds time elapsed
110.326924000 seconds user
2.998557000 seconds sys
Performance counter stats for './new_executable':
89683.53 msec task-clock # 0.989 CPUs utilized
611936207251 instructions # 2.62 insn per cycle
233172220555 cycles # 2.600 GHz
2335713484 l1i_cache_refill # 26.044 M/sec
473958601 branch-misses
90.675836385 seconds time elapsed
These results also show how long each executable took to complete. BOLT rearranges the code in the executable into hot spots and that should reduce the number of instruction cache misses (l1i_cache_refill) and branch misses.