You can build and run an example to learn more and find out if the compiler is generating LSE instructions.
Below is an example program from cppreference.com .
Use a text editor of your choice to save the example program in a file called atomic.c
on your Arm Linux computer.
#include <stdio.h>
#include <threads.h>
#include <stdatomic.h>
atomic_int acnt;
int cnt;
int f(void* thr_data)
{
for(int n = 0; n < 1000; ++n) {
++cnt;
++acnt;
}
return 0;
}
int main(void)
{
thrd_t thr[10];
for(int n = 0; n < 10; ++n)
thrd_create(&thr[n], f, NULL);
for(int n = 0; n < 10; ++n)
thrd_join(thr[n], NULL);
printf("The atomic counter is %u\n", acnt);
printf("The non-atomic counter is %u\n", cnt);
}
The atomic_int C data type is used to indicate that accesses to the acnt variable must be atomic.
The results on different AWS instance types are shown below. You can also try this on any Arm Linux computer.
The AWS A1 instance uses Cortex-A72, without LSE. This can also be done on any Cortex-A53 or Cortex-A72 system.
On Ubuntu 20.04 the default gcc version is 9.4.0. Check this by running:
gcc --version
The output is:
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Now compile the application:
gcc -g atomic.c -o a1 -march=armv8-a -lpthread
objdump -S a1 > a1.dis
Review the disassembly file a1.dis
and check the instructions for incrementing acnt. The sequence is:
Here is a snippet of the disassembly performing this sequence:
994: f947e400 ldr x0, [x0, #4040]
998: 885ffc01 ldaxr w1, [x0]
99c: 0b020021 add w1, w1, w2
9a0: 8803fc01 stlxr w3, w1, [x0]
9a4: 35ffffa3 cbnz w3, 998 <f+0x5c>
The AWS T4g instance with Graviton2 uses Neoverse N1 with LSE. You can also use similar machines with Neoverse N1.
Compile the same application on a T4g instance:
gcc -g atomic.c -o t4g -march=armv8.2-a -lpthread
objdump -S t4g > t4g.dis
Review the file t4g.dis and check the instructions for incrementing acnt. The sequence is:
Here is a snippet of the disassembly performing this sequence:
994: f947e400 ldr x0, [x0, #4040]
998: b8e10002 ldaddal w1, w2, [x0]
Staying on the T4g instance, compile the application with outline-atomics:
gcc -g atomic.c -o t4g.outline -moutline-atomics -lpthread
objdump -S t4g.outline > outline.dis
Review the file outline.dis and see that the instruction to increment acnt is now a branch to something called __aarch64_ldadd4_acq_rel at address 0xb90:
a24: 9400005b bl b90 <__aarch64_ldadd4_acq_rel>
The code for both the load exclusive sequence and the atomic instruction are present as shown in the disassembly snippet below. The section of instructions before the first ret instruction is run on the T4g and the following instructions are run on the A1. This binary will run on both instances with no changes. In exchange for this flexibility there is the overhead to take a branch and run the correct code path.
0000000000000b90 <__aarch64_ldadd4_acq_rel>:
b90: d503245f bti c
b94: d0000090 adrp x16, 12000 <__data_start>
b98: 39404610 ldrb w16, [x16, #17]
b9c: 34000070 cbz w16, ba8 <__aarch64_ldadd4_acq_rel+0x18>
ba0: b8e00020 ldaddal w0, w0, [x1]
ba4: d65f03c0 ret
ba8: 2a0003f0 mov w16, w0
bac: 885ffc20 ldaxr w0, [x1]
bb0: 0b100011 add w17, w0, w16
bb4: 880ffc31 stlxr w15, w17, [x1]
bb8: 35ffffaf cbnz w15, bac <__aarch64_ldadd4_acq_rel+0x1c>
bbc: d65f03c0 ret
As a final check, move back to the A1 instance and compile for armv8.2-a
architecture. The atomic instruction is illegal on the Cortex-A72 and fails.
gcc -g atomic.c -o a1 -march=armv8.2-a -lpthread
./a1
The result is:
Illegal instruction (core dumped)
To check for atomic instructions in applications run objdump on the T4g executable:
objdump -d t4g | grep -i 'cas\|casp\|swp\|ldadd\|stadd\|ldclr\|stclr\|ldeor\|steor\|ldset\|stset\|ldsmax\|stsmax\|ldsmin\|stsmin\|ldumax\|stumax\|ldumin\|stumin' | wc -l
The above command will report a count of 1 instruction, the ldaddal
instruction.
To check whether applications contain load exclusives and store exclusives run this command on the A1 executable. It will report a count of 2.
objdump -d a1 | grep -i 'ldxr\|ldaxr\|stxr\|stlxr' | wc -l
Running on the t4g.outline executable which supports both architectures will report both types of instructions.
Another way to confirm an executable supports both architectures is to run the command:
nm t4g.outline | grep __aarch64_have_lse_atomics | wc -l
If it returns a 1 then it was compiled with outline-atomics.
Large System Extensions introduce atomic instructions to improve performance for Arm systems with many processors. When migrating applications to Neoverse it helps to have an understanding of compilers, compiler options, and libraries. Also, think about the strategy for an application supporting only Neoverse or also including support for processors which don’t include LSE.