Capture a performance profile with Streamline CLI tools

Profiling with the Streamline CLI tools is a three-step process:

  • Use sl-record to capture the raw sampled data for the profile.
  • Use sl-analyze to pre-process the raw sampled data to create a set of function-attributed counters and metrics.
  • Use sl-format.py to pretty-print the function-attributed metrics in a more human-readable form.

Image Alt Text:Streamline CLI tools workflow

Procedure

  1. Download and extract the Streamline CLI tools on your Arm server:

        
    
            
            
                wget https://artifacts.tools.arm.com/arm-performance-studio/2024.2/Arm_Streamline_CLI_Tools_9.2.0_linux_arm64.tgz 
    tar -xzf Arm_Streamline_CLI_Tools_9.2.0_linux_arm64.tgz 
            
        
    

    Follow the instructions in the Install Guide to ensure you have everything set up correctly.

  2. The sl-format.py Python script requires Python 3.8 or later, and depends on several third-party modules. We recommend creating a Python virtual environment containing these modules to run the tools. For example:

        
    
            
            
                # From Bash
    python3 -m venv sl-venv
    source ./sl-venv/bin/activate
    
    # From inside the virtual environment
    python3 -m pip install -r ./streamline_cli_tools/bin/requirements.txt
            
        
    
    Note

    The instructions in this guide assume you have added the <install>/bin/ directory to your PATH environment variable, and that you run all Python commands from inside the virtual environment.

  3. Use sl-record to capture a raw profile of your application and save the data to a directory on the filesystem.

        
    
            
            
                sl-record -C workflow_topdown_basic -o <output.apc> -A <your app command-line>
            
        
    
    • The -C workflow_topdown_basic option selects a predefined group of counters and metrics, and provides a good baseline to start with. Alternatively, you can provide a comma-separated list of specific counters and metrics to capture. To list all of the available counters and metrics for the current machine, use the command sl-record --print counters.

    • The -o option provides the output directory for the capture data. The directory must not already exist because it is created by the tool when profiling starts.

    • The -A option provides the command-line for the user application. This option must be the last option provided to sl-record because all subsequent arguments are passed to the user application.

      • Your application should be a release build, but needs to include symbol information. Build your application with the -g option to include symbol information. Arm recommends that you disable link-time-optimization to make the profile easier to understand.

      • If you are using the workflow_topdown_basic option, ensure that your application workload is at least 20 seconds long, in order to give the core time to capture all of the metrics needed. This time increases linearly as you add more metrics to capture.

    • Optionally, to enable SPE, add the -X workflow_spe option. Enabling SPE significantly increases the amount of data captured and the sl-analyze processing time, so only use this option if you need this data.

    Tip
    Captures are highly customizable, with many different options that allow you to choose how to profile your application. Use the `--help` option to see the full list of options for customizing your captures.
    
  4. Use sl-analyze to process the raw profile of your application and save the analysis output as several CSV files on the filesystem.

        
    
            
            
                sl-analyze --collect-images -o <output_dir> <input_dir.apc>
            
        
    
    • The -collect-images option instructs the tool to assemble all of the referenced binaries and split debug files required for analysis. The files are copied and stored inside the .apc directory, making them ready for analysis.

    • The -o option provides the output directory for the generated CSV files.

    • The positional argument input_dir.apc is the raw profile directory created by sl-record.

    The function profile CSV files generated by sl-analyze contain all the enabled events and metrics, for all functions that were sampled in the profile:

    FilenameDescription
    Functions-<filename>.csvA flat list of functions, sorted by cost, showing per-function metrics.
    callpaths-<filename>.csvA hierarchical list of function call paths in the application, showing per-function metrics for each function per call path location.
    <filename>-bt.csvResults from the analysis of the software-sampled performance counter data, which can include back-traces for each sample.
    <filename>-spe.csvResults from the analysis of the hardware-sampled Statistical Profiling Extension (SPE) data. SPE data does not include call back-trace information.
  5. Use the sl-format.py script to generate a simpler pretty-printed XLSX spreadsheet from the CSV files generated in the previous step, using sl-analyze. The script formats the metrics columns to make the data more readable, and to add colors to highlight bad values.

        
    
            
            
                python3 sl-format.py -o <output.xlsx> <input.csv>
            
        
    
    • The -o option provides the output file path to save the XLSX file to.
    • The positional argument is the functions-*.csv file created by sl-analyze.

    Refer to the Streamline CLI Tools user guide to learn how you can create and specify custom format definitions that are used to change the pretty-printed data visualization.

  6. View your report in Excel or other compatible application. In functions reports, problem areas are indicated in red, to help you focus on the main problems.

Image Alt Text:An example functions report

See our example report to learn more about how to interpret the results.

Capturing a system-wide profile

To capture a system-wide profile, which captures all processes and threads, run sl-record with the -S yes option and omit the -A application-specific option and following arguments.

In systems without the kernel patches, system-wide profiles can capture the top-down metrics. To keep the captures to a usable size, it may be necessary to limit the duration of the profiles to less than 5 minutes.

Capturing top-down metrics without the kernel patches

To capture top-down metrics in a system without the kernel patches, there are three options available:

  • To capture a system-wide profile, which captures all processes and threads, run with the -S yes option and omit the -A application-specific option and following arguments. To keep the captures to a usable size, it may be necessary to limit the duration of the profiles to less than 5 minutes.

  • To reliably capture single-threaded application profile, add the --inherit no option to the command line. However, in this mode metrics are only captured for the first thread in the application process and any child threads or processes are ignored.

  • For multi-threaded applications, the tool provides an experimental option, --inherit poll, which uses polling to spot new child threads and inject the instrumentation. This allows metrics to be captured for a multi-threaded application, but has some limitations:

  • Short-lived threads may not be detected by the polling.

  • Attaching perf to new threads without inherit support requires many new file descriptors to be created per thread. This can result in the application failing to open files due to the process hitting its inode limit.

Minimizing profiling application impact

The sl-record application requires some portion of the available processor time to capture the data and prepare it for storage. When profiling a system with a high number of CPU cores, Arm recommends that you leave a small number of cores free so that the profiler can run in parallel without impacting the application. You can achieve this in two different ways:

  • Running an application with fewer threads than the number of cores available.

  • Running the application under taskset to limit the number of cores that the application can use. You must only taskset the application, not sl-record, for example:

        
    
            
            
                sl-record -C … -o … -A taskset <core_mask> <your app command-line>
            
        
    
Note

The number of samples made is independent of the number of counters and metrics that you enable. Enabling more counters reduces the effective sample rate per counter, and does not significantly increase the performance impact that capturing has on the running application.

Back
Next