BOLT with Perf samples

The steps to optimize an executable with BOLT using Perf samples is below.

Collect Perf samples

Run your executable in the normal use case and collect a samples performance profile. This will output a perf.data file containing the profile which will be used to optimize the executable.

Record samples while running your application. Substitute the actual name of your application for executable:

    

        
        
            perf record -e cycles:u -o perf.data -- ./executable
        
    

Perf prints the total number of samples and the size of the perf.data file:

    

        
        [ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.381 MB perf.data (9957 samples) ]

        
    

Convert the profile into BOLT format

perf2bolt converts the profile into a BOLT data format. For the given sample data, perf2bolt finds all instruction pointers in the profile, maps them back to the assembly instructions, and outputs a count of how many times each assembly instruction was sampled.

If you application is named executable, run the command below to convert the profile data:

    

        
        
            perf2bolt -p perf.data -o perf.fdata -nl ./executable
        
    

Below is example output from perf2bolt, it has read all samples and created the file perf.fdata.

    

        
        BOLT-INFO: shared object or position-independent executable detected
PERF2BOLT: Starting data aggregation job for perf.data
PERF2BOLT: spawning perf job to read events without LBR
PERF2BOLT: spawning perf job to read mem events
PERF2BOLT: spawning perf job to read process events
PERF2BOLT: spawning perf job to read task events
BOLT-INFO: Target architecture: aarch64
BOLT-INFO: BOLT version: c66c15a76dc7b021c29479a54aa1785928e9d1bf
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0x200000, offset 0x200000
BOLT-INFO: enabling relocation mode
BOLT-INFO: disabling -align-macro-fusion on non-x86 platform
BOLT-INFO: enabling strict relocation mode for aggregation purposes
BOLT-INFO: pre-processing profile using perf data aggregator
BOLT-INFO: binary build-id is:     21dbca691155f1e57825e6381d727842f3d43039
PERF2BOLT: spawning perf job to read buildid list
PERF2BOLT: matched build-id and file name
PERF2BOLT: waiting for perf mmap events collection to finish...
PERF2BOLT: parsing perf-script mmap events output
PERF2BOLT: waiting for perf task events collection to finish...
PERF2BOLT: parsing perf-script task events output
PERF2BOLT: input binary is associated with 1 PID(s)
PERF2BOLT: waiting for perf events collection to finish...
PERF2BOLT: parsing basic events (without LBR)...
PERF2BOLT: waiting for perf mem events collection to finish...
PERF2BOLT: processing basic events (without LBR)...
PERF2BOLT: read 9957 samples
PERF2BOLT: out of range samples recorded in unknown regions: 7 (0.1%)
PERF2BOLT: wrote 321 objects and 0 memory objects to perf.fdata

        
    

Run BOLT to generate the optimized executable

The final step is to generate a new executable using perf.fdata.

To run BOLT use the command below and substitute the name of your application:

    

        
        
            llvm-bolt ./executable -o ./new_executable -data perf.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold -split-eh -dyno-stats
        
    

The output from llvm-bolt describes the executable stats before and after optimization:

    

        
        BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: aarch64
BOLT-INFO: BOLT version: c66c15a76dc7b021c29479a54aa1785928e9d1bf
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0x200000, offset 0x200000
BOLT-INFO: enabling relocation mode
BOLT-INFO: disabling -align-macro-fusion on non-x86 platform
BOLT-INFO: pre-processing profile using branch profile reader
BOLT-INFO: operating with basic samples profiling data (no LBR).
BOLT-INFO: normalizing samples by instruction count.
BOLT-INFO: number of removed linker-inserted veneers: 0
BOLT-INFO: 15 out of 52 functions in the binary (28.8%) have non-empty execution profile
BOLT-INFO: removed 1 empty block
BOLT-INFO: basic block reordering modified layout of 11 functions (73.33% of profiled, 17.46% of total)
BOLT-INFO: 1 Functions were reordered by LoopInversionPass
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

              806550 : executed forward branches
              211117 : taken forward branches
             1302786 : executed backward branches
             1218161 : taken backward branches
               69927 : executed unconditional branches
               52487 : all function calls
               11166 : indirect calls
                   0 : PLT calls
             9949829 : executed instructions
             2116267 : executed load instructions
                   0 : executed store instructions
                   0 : taken jump table branches
                   0 : taken unknown indirect branches
             2179263 : total branches
             1499205 : taken branches
              680058 : non-taken conditional branches
             1429278 : taken conditional branches
             2109336 : all conditional branches
                   0 : linker-inserted veneer calls

             1094891 : executed forward branches (+35.7%)
               81610 : taken forward branches (-61.3%)
             1014445 : executed backward branches (-22.1%)
              877658 : taken backward branches (-28.0%)
              269990 : executed unconditional branches (+286.1%)
               52487 : all function calls (=)
               11166 : indirect calls (=)
                   0 : PLT calls (=)
            10514250 : executed instructions (+5.7%)
             2116267 : executed load instructions (=)
                   0 : executed store instructions (=)
                   0 : taken jump table branches (=)
                   0 : taken unknown indirect branches (=)
             2379326 : total branches (+9.2%)
             1229258 : taken branches (-18.0%)
             1150068 : non-taken conditional branches (+69.1%)
              959268 : taken conditional branches (-32.9%)
             2109336 : all conditional branches (=)
                   0 : linker-inserted veneer calls (=)

BOLT-INFO: Starting stub-insertion pass
BOLT-INFO: Inserted 0 stubs in the hot area and 0 stubs in the cold area. Shared 0 times, iterated 1 times.
BOLT-INFO: padding code to 0x600000 to accommodate hot text
BOLT-INFO: setting _end to 0x600fb0
BOLT-INFO: setting __hot_start to 0x400000
BOLT-INFO: setting __hot_end to 0x400d88
BOLT-INFO: patched build-id (flipped last bit)

        
    

The optimized executable is now available as new_executable.

Back
Next