Identify application bottlenecks with the CPU Microarchitecture recipe

Optimize application performance using Arm Performix CPU microarchitecture analysis

Log an issue

Fork and edit

Discuss on Discord

Optimize application performance using Arm Performix CPU microarchitecture analysis

Run the CPU Microarchitecture recipe

To identify performance bottlenecks, run the CPU Microarchitecture recipe in Arm Performix. Arm Performix uses microarchitectural sampling to show which instruction pipeline stages dominate program latency, and then highlights ways to improve those bottlenecks.

Start by reviewing the code in main.cpp, the program generates a 1920×1080 bitmap image of the fractal.

    

        
        
#include "Mandelbrot.h"
#include <iostream>

using namespace std;

int main(int argc, char* argv[]){

    const int NUM_THREADS = argc == 2 ? std::stoi(argv[1]) : 1;
    std::cout << "Number of Threads = " << NUM_THREADS << std::endl;

    Mandelbrot::Mandelbrot myplot(1920, 1080, NUM_THREADS);
    myplot.draw("Green-Parallel-512.bmp", Mandelbrot::Mandelbrot::GREEN);

    return 0;
}

When Arm Performix launches the executable on the target machine, it does so from a temporary agent directory. If your code uses a relative path to save the image, the image is written to that temporary folder and might be deleted or you can’t find it.

To prevent this, edit the myplot.draw() line in main.cpp to use the absolute path to your project’s image folder (for example, /home/ubuntu/mandelbrot-example/Green-Parallel-512.bmp), and then rebuild the application.

In the Arm Performix application on your host machine, select the CPU Microarchitecture recipe.

Image Alt Text:Arm Performix CPU Microarchitecture configuration screen CPU Microarchitecture Configuration

Select the target you configured in the setup section. If this is your first run on this target, you might need to select Install Tools to copy the collection tools to the target. After the tools are installed, you will see that the target is ready.

Next, select the Workload type and select Launch a new process.

Enter the absolute path to your executable in the Workload field. For example, /home/ubuntu/mandelbrot-example/builds/mandelbrot-baseline. Make sure to add the number of threads argument.

Note

Use the full path to your executable because the Workload field doesn’t currently support shell-style path expansion.

Before starting the analysis, you can customize the configuration. For instance, you can set a time limit for the workload or choose specific metrics to investigate. You can also adjust the sampling rate (High, Normal, or Low) to balance collection overhead against sampling granularity. Because this Mandelbrot example is a native C++ application, you can ignore the Collect managed code stacks toggle, which is used for Java or .NET workloads.

When your configuration is ready, select Run Recipe to launch the workload and collect the performance data.

View the run results

Arm Performix generates a high-level instruction pipeline view, highlighting where the most time is spent.

Image Alt Text:Arm Performix high-level instruction pipeline results Instruction Pipeline View

In this breakdown, you see frontend and backend Stalls. Within the backend stalls, work is split between integer and floating-point operations.

There is no measured SIMD activity, even though this workload is highly parallelizable.

The Insights panel explains the highest category, which is frontend stalls for this system.

Image Alt Text:Arm Performix insights panel highlighting frontend stalls Insights Panel

To inspect executed instruction types in more detail, use the Instruction Mix recipe in the next step.

What you’ve learned and what’s next

In this section:

You ran the CPU Microarchitecture recipe on the Mandelbrot application.
You identified that the application spends most of its time in Backend Stalls without using SIMD operations.

Next, you’ll run the Instruction Mix recipe to confirm where optimization opportunities exist and implement vectorization.

Back

Optimize application performance using Arm Performix CPU microarchitecture analysis

Introduction

Set up the target environment and compile the application

Identify application bottlenecks with the CPU Microarchitecture recipe

Analyze SIMD utilization with the Instruction Mix recipe

Next Steps

Optimize application performance using Arm Performix CPU microarchitecture analysis

Run the CPU Microarchitecture recipe

View the run results

What you’ve learned and what’s next