In this section you will learn how to take advantage of specific Arm instructions.
The following code calculates the dot product of two integer arrays.
Copy the code and save it to a file named dotprod.c
.
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
int32_t dotprod(int32_t *A, int32_t *B, size_t N) {
int32_t result = 0;
for (size_t i=0; i < N; i++) {
result += A[i]*B[i];
}
return result;
}
int main() {
const int N = 128;
int32_t A[N], B[N];
int32_t dot = dotprod(A, B, N);
printf("dotprod = %d", dot);
}
Such code is common in audio and video codecs where integer arithmetic is used instead of floating-point.
Compile the code:
gcc -O2 -fno-inline dotprod.c -o dotprod
Look at the assembly code:
objdump -D dotprod
The objdump
instructions are omitted from the remainder of the examples, but you can use objdump
every time you recompile to see the assembly output.
The assembly output for the dotprod()
function is:
dotprod:
mov x6, x0
cbz x2, .L4
mov x3, 0
mov w0, 0
.L3:
ldr w5, [x6, x3, lsl 2]
ldr w4, [x1, x3, lsl 2]
add x3, x3, 1
madd w0, w5, w4, w0
cmp x2, x3
bne .L3
ret
.L4:
mov w0, 0
ret
You can see that it’s a pretty standard implementation, doing one element at a time. The option -fno-inline
is necessary to avoid inlining any code from the function dot-prod()
into main()
for performance reasons. In general, this is a good thing but demonstrating the autovectorization process is more difficult if there is no easy way to distinguish the caller from the callee.
Next, increase the optimization level to -O3
, recompile, and observe the assembly output again:
gcc -O3 -fno-inline dotprod.c -o dotprod
The assembly for the dotprod()
function is now:
dotprod:
mov x4, x0
cbz x2, .L7
sub x0, x2, #1
cmp x0, 2
bls .L8
lsr x0, x2, 2
mov x3, 0
movi v0.4s, 0
lsl x0, x0, 4
.L4:
ldr q2, [x4, x3]
ldr q1, [x1, x3]
add x3, x3, 16
mla v0.4s, v2.4s, v1.4s
cmp x0, x3
bne .L4
addv s0, v0.4s
and x3, x2, -4
fmov w0, s0
tst x2, 3
beq .L1
.L3:
ldr w8, [x4, x3, lsl 2]
add x6, x3, 1
ldr w7, [x1, x3, lsl 2]
lsl x5, x3, 2
madd w0, w8, w7, w0
cmp x2, x6
bls .L1
add x6, x5, 4
add x3, x3, 2
ldr w7, [x4, x6]
ldr w6, [x1, x6]
madd w0, w7, w6, w0
cmp x2, x3
bls .L1
add x5, x5, 8
ldr w2, [x1, x5]
ldr w1, [x4, x5]
madd w0, w2, w1, w0
.L1:
ret
.L7:
mov w0, 0
ret
.L8:
mov x3, 0
mov w0, 0
b .L3
The code is larger but you can see that some autovectorization has taken place.
The label .L4
includes the main loop and you can see that the mla
instruction is used to multiply and accumulate the dot products, 4 elements at a time.
At the end of this loop, the addv
instruction does a horizontal addition of the 4 elements in the final vector and returns the final sum. The main loop is executed while the number of the remaining elements is a multiple of 4. The rest of the elements are processed one at a time in the .L3
section of code.
With the new code, you can expect a performance gain of about 4x.
You might be wondering if there is a way to hint to the compiler that the sizes are always going to be multiples of 4 and avoid the last part of the code.
The answer is yes but it depends on the compiler. In the case of gcc, it is enough to add an instruction that ensures the sizes are multiples of 4.
Modify the dotprod()
function to add the multiples of 4 hint as shown below:
int32_t dotprod(int32_t *A, int32_t *B, size_t N) {
int32_t result = 0;
N -= N % 4;
for (size_t i=0; i < N; i++) {
result += A[i]*B[i];
}
return result;
}
Compile again with -O3
:
gcc -O3 -fno-inline dotprod.c -o dotprod
The assembly output with -O3
is much more compact because it does not need to handle the left over bytes:
dotprod:
ands x2, x2, -4
beq .L4
movi v0.4s, 0
lsl x3, x2, 2
mov x2, 0
.L3:
ldr q2, [x1, x2]
ldr q1, [x0, x2]
add x2, x2, 16
mla v0.4s, v2.4s, v1.4s
cmp x3, x2
bne .L3
addv s0, v0.4s
fmov w0, s0
ret
.L4:
fmov s0, wzr
fmov w0, s0
ret
Is there anything else the compiler can do?
Modern compilers are very proficient at generating code that utilizes all available instructions, provided they have the right information.
For example, the dotprod()
function operates on int32_t
elements. What if you could limit the range to 8-bit?
There is an Armv8 ISA extension that provides signed and unsigned dot product instructions to perform a dot product across 8-bit elements of 2 vectors and store the results in the 32-bit elements of the resulting vector.
Could the compiler make use of the instructions automatically or does the code need to be hand-written using intrinsics?
It turns out that some compilers will detect that the number or the iterations is a multiple of the number of elements in a SIMD vector.
Modify the dotprod.c
code to use int8_t
types for A
and B
arrays as shown below:
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
int32_t dotprod(int8_t *A, int8_t *B, size_t N) {
int32_t result = 0;
N -= N % 4;
for (size_t i=0; i < N; i++) {
result += A[i]*B[i];
}
return result;
}
int main() {
const int N = 128;
int8_t A[N], B[N];
int32_t dot = dotprod(A, B, N);
printf("dotprod = %d", dot);
}
Compile the code:
gcc -O3 -fno-inline -march=armv8-a+dotprod dotprod.c -o dotprod
You need to compile with the architecture flag to use the dot product instructions.
The assembly output will be quite large as the use of SDOT
can only work in the main loop where the size is a multiple of 16. The compiler will unroll the loop to use Advanced SIMD instructions if the size is greater than 8 and byte-handling instructions if the size is smaller.
You can eliminate the extra tail instructions by converting N -= N % 4
to 8 or even 16 as shown below:
int32_t dotprod(int8_t *A, int8_t *B, size_t N) {
int32_t result = 0;
N -= N % 16;
for (size_t i=0; i < N; i++) {
result += A[i]*B[i];
}
return result;
}
The resulting assembly output only handles sizes that are multiple of 16:
dotprod:
ands x2, x2, -16
beq .L4
movi v0.4s, 0
mov x3, 0
.L3:
ldr q1, [x1, x3]
ldr q2, [x0, x3]
sdot v0.4s, v1.16b, v2.16b
add x3, x3, 16
cmp x2, x3
bne .L3
addv s0, v0.4s
fmov w0, s0
ret
.L4:
fmov s0, wzr
fmov w0, s0
ret
As before, at the end of the loop addv
instruction is used to perform a horizontal addition of the 32-bit integer elements and produce the final dot product sum.
This particular implementation will be up to 4x faster than the previous version using mla
.