Let’s try another example, one from gcc restrict pointer examples :
void process_data (const char *in, char *out, size_t size)
{
for (int i = 0; i < size; i++)
out[i] = in[i] + in[i + 1];
}
This example will be easier to demonstrate with SVE2. We found gcc 13 to have a better result than clang; this is the output of gcc-13 -O3 -march=armv9-a
:
process_data:
cbz x2, .L1
add x5, x0, 1
cntb x3
sub x4, x1, x5
sub x3, x3, #1
cmp x4, x3
bls .L6
mov w4, w2
mov x3, 0
whilelo p0.b, wzr, w2
.L4:
ld1b z0.b, p0/z, [x0, x3]
ld1b z1.b, p0/z, [x5, x3]
add z0.b, z0.b, z1.b
st1b z0.b, p0, [x1, x3]
incb x3
whilelo p0.b, w3, w4
b.any .L4
.L1:
ret
.L6:
mov x3, 0
.L3:
ldrb w4, [x5, x3]
ldrb w6, [x0, x3]
add w4, w4, w6
strb w4, [x1, x3]
add x3, x3, 1
cmp x2, x3
bne .L3
ret
Do not worry about each instruction in the assembly here, but notice that gcc has added 2 loops, one that uses the SVE2 while*
instructions to the processing (.L4) and one scalar loop (.L3). The latter is executed in case there is any pointer aliasing (basically, if there is any overlap between the memory pointers). Let’s try adding restrict
to pointer in
:
void process_data (const char *restrict in, char *out, size_t size)
{
for (int i = 0; i < size; i++)
out[i] = in[i] + in[i + 1];
}
This is now the output from gcc-13:
process_data:
cbz x2, .L1
add x5, x0, 1
mov w4, w2
mov x3, 0
whilelo p0.b, wzr, w2
.L3:
ld1b z1.b, p0/z, [x0, x3]
ld1b z0.b, p0/z, [x5, x3]
add z0.b, z0.b, z1.b
st1b z0.b, p0, [x1, x3]
incb x3
whilelo p0.b, w3, w4
b.any .L3
.L1:
ret
This is a huge improvement! The code size is down from 30 lines to 14, less than half of the original size. In both cases, note that the main loop (.L4
in the former case, .L3
in the latter) is exactly the same, but the entry and exit code of the function is very much simplified. The compiler was able to distinguish that the memory pointed by in
does not overlap with memory pointed by out
, it was able to simplify the code by eliminating the scalar loop, and also remove the associated code that checked if it needed to enter it.
Why is this important if the main loop is still the same?
If your function is going to be called once and run over tens of billions of elements, then saving a few instructions before and after the main loop does not really matter.
But, if your function is going to be called on smaller sizes or even billions of times, then saving a few instructions in this function means we are saving a few billions of instructions in total, which means less time spent running on the CPU and less energy wasted.