Understand your goal

If you intend for your application to be portable across a variety of Arm servers, you should select a target architecture using the -march= flag with a value matching the lowest Arm architecture among the set of systems you plan to use. This is enabled by the backward compatibility of the Arm architecture.

If you are running your C++ application in a memory-constrained environment, such as a containerized environment, you should optimize for size.

If you are building an application to achieve the highest performance on a specific processor, such as AWS Graviton4 (Arm Neoverse V2), you should consider specifying the system using the -mcpu flag.

While these are general guidelines, you should experiment with the various optimization settings.

What is a vectorizable loop?

Use an editor to copy and paste the C++ code below into a file named vectorizable_loop.cpp.

The code initializes a vector of 100 million elements, doubles each element, and stores the result in the same vector. This is repeated 5 times to calculate the average runtime.

    

        
        
#include <vector>
#include <chrono>
#include <iostream>
#include <unistd.h> // for getpid()

void vectorizable_loop(std::vector<int>& data) {
    double total_elapsed_time = 0.0;
    const int iterations = 5;

    for (int iter = 0; iter < iterations; ++iter) {
        auto start = std::chrono::high_resolution_clock::now();
        
        for (size_t i = 0; i < data.size(); ++i) {
            data[i] *= 2;
        }
        
        auto end = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double> elapsed = end - start;
        total_elapsed_time += elapsed.count();
        std::cout << "Elapsed time for iteration " << iter + 1 << ": " << elapsed.count() << " seconds" << std::endl;
    }

    double average_elapsed_time = total_elapsed_time / iterations;
    std::cout << "Average elapsed time: " << average_elapsed_time << " seconds" << std::endl;
}

int main() {
    const size_t size = 100000000; // 100 million elements
    std::vector<int> data(size, 1);

    std::cout << "Process ID (PID): " << getpid() << std::endl;

    vectorizable_loop(data);

    return 0;
}

    

Use different versions of the g++ compiler

Compare different compiler versions by building vectorizable_loop.cpp using the same arguments with different compiler versions.

Run the commands below to use the default g++ compiler (13.3) and then using the g++ version 9 installed earlier, with the same flags:

    

        
        
g++ vectorizable_loop.cpp -o vectorizable_loop_gcc_13
g++-9 vectorizable_loop.cpp -o vectorizable_loop_gcc_9
./vectorizable_loop_gcc_13
./vectorizable_loop_gcc_9

    

In this simple example, you can see about a 20% speed improvement moving from version 9 to version 13.

The output is shown below:

    

        
        // gcc v.13
Process ID (PID): 5130
Elapsed time for iteration 1: 0.291529 seconds
Elapsed time for iteration 2: 0.291689 seconds
Elapsed time for iteration 3: 0.291874 seconds
Elapsed time for iteration 4: 0.292063 seconds
Elapsed time for iteration 5: 0.29209 seconds
Average elapsed time: 0.291849 seconds

// gcc v.9
Process ID (PID): 5131
Elapsed time for iteration 1: 0.363335 seconds
Elapsed time for iteration 2: 0.362316 seconds
Elapsed time for iteration 3: 0.361944 seconds
Elapsed time for iteration 4: 0.36257 seconds
Elapsed time for iteration 5: 0.362911 seconds
Average elapsed time: 0.362615 seconds

        
    

You can see that newer compiler versions can impact performance.

Target performance

Another way to observe compiler impact is the optimization level.

Compile the application with three different optimization levels:

    

        
        
g++ -O1 vectorizable_loop.cpp -o level_1
g++ -O2 vectorizable_loop.cpp -o level_2
g++ -O3 vectorizable_loop.cpp -o level_3

    

When running the three output binaries, you can see a significant improvement in elapsed time with minimal change in file size and compile time. Note that larger code bases might see larger output binary sizes and longer compilation times.

    

        
        
./level_1
./level_2
./level_3

    

The output from each executable is below:

    

        
        Average elapsed time: 0.0526484 seconds
Average elapsed time: 0.0420332 seconds
Average elapsed time: 0.0155661 seconds

        
    

Here you can see notable performance improvement from using higher optimization levels.

Note

To understand which lower level optimization are used by -O1, -O2 and -O3 we can use the g++ <optimization level> -Q --help=optimizers command.

Understanding the optimizations

Naturally, the next question is to understand which part of your source code was optimized between the outputs above. Full optimization reports generated by compilers like GCC provide a detailed tree of reports through various stages of the optimization process. For beginners, these reports can be overwhelming due to the sheer volume of information they contain, covering every aspect of the code’s transformation and optimization.

For a more manageable overview, you can enable basic optimization information (opt-info) reports using specific arguments such as -fopt-info-vec, which focuses on vectorization optimizations. You can customize the -fopt-info flag by changing the info bit to target different types of optimizations, making it easier to pinpoint specific areas of interest.

Firstly, to see which part of the source code was optimized between levels 1 and 2 you can run the following commands to see if the vectorizable loop was indeed vectorized:

    

        
        
g++ -O1 vectorizable_loop.cpp -o level_1 -fopt-info-vec

    

Compiling with the -O1 flag showed no terminal output, indicating no vectorization was performed.

Next, run the command below with the -O2 flag:

    

        
        
g++ -O2 vectorizable_loop.cpp -o level_2 -fopt-info-vec

    

This time the -O2 flag enables the loop to be vectorized as can be seen from the output below:

    

        
        vectorizable_loop.cpp:13:30: optimized: loop vectorized using 16 byte vectors
/usr/include/c++/13/bits/stl_algobase.h:930:22: optimized: loop vectorized using 16 byte vectors

        
    

To see the optimizations that were performed and missed between level 2 and level 3, you can direct the terminal output from all optimizations (-fopt-info) to a text file with the commands below:

    

        
        
g++ -O2 vectorizable_loop.cpp -o level_2 -fopt-info 2>&1 | tee level2.txt
g++ -O3 vectorizable_loop.cpp -o level_3 -fopt-info 2>&1 | tee level3.txt

    

Comparing the outputs between different levels can highlight where in your source code opportunities to optimize code were missed, for example with the diff command.

Target balanced performance

For balanced performance, AWS recommends using the -mcpu=neoverse-512tvb option. The value neoverse-512tvb here instructs GCC to optimize for Neoverse cores that support SVE and have a vector length of 512 bits.

This option directs GCC to target Neoverse cores capable of executing four 128-bit Advanced SIMD arithmetic instructions per cycle, as well as an equivalent number of SVE arithmetic instructions per cycle (two for 256-bit SVE, four for 128-bit SVE). This tuning is more general than optimizing for a specific core like Neoverse V1, yet more specific than the default tuning options.

There are numerous options available for g++ that impact code performance and size. A basic understanding of these options can help you to pick the best settings for your application.

Back
Next