The previous example using the SDOT/UDOT instructions is only one of the Arm-specific optimizations possible.

While it is not possible to demonstrate all of the specialized instructions offered by the Arm architecture, it’s worth looking at another example.

Below is a very simple loop, calculating what is known as a Sum of Absolute Differences (SAD). Such code is very common in video codecs and used in calculating differences between video frames.

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>

uint32_t sad8(int8_t *A, int8_t *B, size_t N) {
uint32_t result = 0;
N -= N % 16;
for (size_t i=0; i < N; i++) {
result += abs(A[i] - B[i]);
}
return result;
}

int main() {
const int N = 128;
int8_t A[N], B[N];

}

A hint to the compiler was added that the size is a multiple of 16 to avoid generating cases for smaller lengths. This is for demonstration purposes only.

Save the above code to a file named sadtest.c and compile it:

The assembly output for sad8() is the following:

ands    x2, x2, -16
beq     .L4
movi    v3.4s, 0
mov     x3, 0
.L3:
ldr     q1, [x0, x3]
ldr     q2, [x1, x3]
sabdl2  v0.8h, v1.16b, v2.16b
sabal   v0.8h, v1.8b, v2.8b
cmp     x2, x3
bne     .L3
fmov    w0, s3
ret
.L4:
fmov    s3, wzr
fmov    w0, s3
ret

You can see that the compiler generates code that uses 3 specialized instructions that exist only on Arm: SABDL2 , SABAL and SADALP .

The accumulator variable is not 8-bit but 32-bit, so the typical SIMD implementation that would involve 16 x 8-bit subtractions, then 16 x absolute values and 16 x additions would not do, and a widening conversion to 32-bit would have to take place before the accumulation.

This would mean that 4x items at a time would be accumulated but, with the use of these instructions, the performance gain can be up to 16x faster than the original scalar code, or about 4x faster than the typical SIMD implementation.

For completeness the SVE2 version will be provided, which does not depend on size being a multiple of 16.

This version is without the N -= N % 16 before the loop.

You can compile it on any Arm system (even one without support for SVE2) just by adding the appropriate -march flag:

Depending on your compiler version -march=armv9-a might not be available. If this is the case, you can use -march=march8-a+sve2 instead.

The SVE2 assembly output for sad8() is:

ands    x2, x2, -16
beq     .L4
mov     x3, 0
mov     x4, x2
whilelo p0.b, xzr, x2
uqdecb  x4
mov     z2.b, #0
mov     z3.b, #1
ptrue   p1.b, all
.L3:
ld1b    z0.b, p0/z, [x0, x3]
ld1b    z1.b, p0/z, [x1, x3]
sel     z1.b, p0, z1.b, z0.b
whilelo p0.b, x3, x4
sabd    z0.b, p1/m, z0.b, z1.b
incb    x3
udot    z2.s, z0.b, z3.b
b.any   .L3
fmov    w0, s2
ret
.L4:
fmov    s2, wzr
fmov    w0, s2
ret

## Conclusion

You might ask why you should learn about autovectorization if you need to have specialized knowledge of instructions like SDOT/SADAL in order to benefit.

Autovectorization is a tool. The goal is to minimize the effort required by developers and maximize the performance, while at the same time requiring low maintenance in terms of code size.

It is far easier to maintain hundreds or thousands of functions that are known to generate the fastest code using autovectorization for all platforms, than it is to maintain the same number of functions in multiple versions for each supported architecture and SIMD engine.

As with most tools, the better you know how to use it, the better the results will be.

Back
Next