You may have already experienced some form of autovectorization by reading Understand the restrict keyword in C99 .

The example in the previous section is a classic textbook example that the compiler will autovectorize by using restrict.

Compile the previously saved files:

    

        
        
            gcc -O2 addvec.c -o addvec
gcc -O2 addvec_neon.c -o addvec_neon
        
    

Generate the assembly output using:

    

        
        
            objdump -D addvec 
        
    

The assembly output of the addvec() function is shown below:

    

        
        addvec:
        mov     x3, 0
.L2:
        ldr     s0, [x1, x3, lsl 2]
        ldr     s1, [x2, x3, lsl 2]
        fadd    s0, s0, s1
        str     s0, [x0, x3, lsl 2]
        add     x3, x3, 1
        cmp     x3, 100
        bne     .L2
        ret

        
    

Generate the assembly output for addvec_neon using:

    

        
        
            objdump -D addvec_neon
        
    

The assembly output for the addvec() function from the addvec_neon executable is shown below:

    

        
        addvec:
        mov     x3, 0
.L6:
        ldr     q0, [x1, x3]
        ldr     q1, [x2, x3]
        fadd    v0.4s, v0.4s, v1.4s
        str     q0, [x0, x3]
        add     x3, x3, 16
        cmp     x3, 400
        bne     .L6
        ret

        
    

The second example uses the Advanced SIMD/Neon instruction fadd with operands v0.4s, v1.4s to perform calculations in 4 x 32-bit floating-point elements.

Add the restrict keyword to the output argument C in the addvec() function in addvec.c:

    

        
        
            void addvec(float *restrict C, float *A, float *B) {
    for (size_t i=0; i < N; i++) {
    	C[i] = A[i] + B[i];
    }
}
        
    

Recompile and check the assembly output again:

    

        
        
            gcc -O2 addvec.c -o addvec
objdump -D addvec
        
    

The assembly output for the addvec function is now:

    

        
        addvec:
        mov     x3, 0
.L2:
        ldr     q0, [x1, x3]
        ldr     q1, [x2, x3]
        fadd    v0.4s, v0.4s, v1.4s
        str     q0, [x0, x3]
        add     x3, x3, 16
        cmp     x3, 400
        bne     .L2
        ret

        
    

As you can see, the compiler has enabled autovectorization for this algorithm and the output is identical to the hand-written function.

Strictly speaking, you don’t even need restrict in such a trivial loop as it will be autovectorized anyway when certain optimization levels are added to the compilation flags (-O2 for clang, -O3 for gcc). However, the use of restrict simplifies the code and generates SIMD code similar to the hand written version in addvec_neon.c.

The reason for this is related to how each compiler decides whether to use autovectorization or not.

For each candidate loop the compiler will estimate the possible performance gains against a cost model, which is affected by many parameters and of course the optimization level in the compilation flags.

The cost model estimates whether the autovectorized code grows in size and if the performance gains are enough to outweigh the increase in code size. Based on this estimation, the compiler will decide to use vectorized code or fall back to a more ‘safe’ scalar implementation. This decision, however, is fluid and is constantly reevaluated during compiler development.

Compiler cost model analysis is beyond the scope of this Learning Path but the above example demonstrates how autovectorization can be triggered by a flag.

You will see some more advanced examples in the next sections.

Back
Next