Try out Large System Extensions (LSE) using an example C program

Let’s take a look at an example to learn more and find out if the compiler is generating LSE instructions.

Shown below is an example program from . Using a file editor of your choice, save this content into a file called atomic.c on your running instance:


            #include <stdio.h>
#include <threads.h>
#include <stdatomic.h>
atomic_int acnt;
int cnt;
int f(void* thr_data)
    for(int n = 0; n < 1000; ++n) {
    return 0;
int main(void)
    thrd_t thr[10];
    for(int n = 0; n < 10; ++n)
        thrd_create(&thr[n], f, NULL);
    for(int n = 0; n < 10; ++n)
        thrd_join(thr[n], NULL);
    printf("The atomic counter is %u\n", acnt);
    printf("The non-atomic counter is %u\n", cnt);

The atomic_int C data type is used to indicate that accesses to the acnt variable must be atomic.

Let’s start on an AWS A1 instance. This is Cortex-A72, without LSE. This can also be done on any Cortex-A53 or Cortex-A72 system.

A1 Instance

On Ubuntu 20.04 the default gcc version is 9.4.0. Check this by running:


            gcc --version

The output is:


        gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO


Now compile the application:


            gcc -g atomic.c -o a1 -march=armv8-a -lpthread
objdump -S a1 > a1.dis

Review the disassembly file a1.dis and check the instructions for incrementing acnt. The sequence is:

  • Address of acnt is loaded into x0
  • Value of acnt is loaded into w1 using load exclusive
  • Add 1 to acnt
  • Store exclusive to write the new value
  • Check if the store succeed and if not loop back to 0x998 and load again

Here is a snippet of the disassembly performing this sequence:


        994:   f947e400        ldr     x0, [x0, #4040]
998:   885ffc01        ldaxr   w1, [x0]
99c:   0b020021        add     w1, w1, w2
9a0:   8803fc01        stlxr   w3, w1, [x0]
9a4:   35ffffa3        cbnz    w3, 998 <f+0x5c>


Now let’s move to a T4g instance with Graviton2.

T4g Instance

Compile the same application on a T4g instance. This instance uses Neoverse-N1 with LSE. Similar machines with Neoverse-N1 can also be used.


            gcc -g atomic.c -o t4g -march=armv8.2-a -lpthread
objdump -S t4g > t4g.dis

Review the file t4g.dis and check the instructions for incrementing acnt. The sequence is:

  • Address of acnt is loaded into x0
  • Value of acnt is updated using a single instruction to add 1 to a word in memory

Here is a snippet of the disassembly performing this sequence:


        994:   f947e400        ldr     x0, [x0, #4040]
998:   b8e10002        ldaddal w1, w2, [x0]


Staying on the T4g instance, let’s compile the application with outline-atomics:


            gcc -g atomic.c -o t4g.outline  -moutline-atomics -lpthread
objdump -S t4g.outline > outline.dis

Review the file outline.dis and see that the instruction to increment acnt is now a branch to something called __aarch64_ldadd4_acq_rel at address 0xb90:


         a24:   9400005b        bl      b90 <__aarch64_ldadd4_acq_rel>


The code for both the load exclusive sequence and the atomic instruction are present as shown in the disassembly snippet below. The section of instructions before the first ret instruction is run on the T4g and the following instructions are run on the A1. This binary will run on both instances with no changes. In exchange for this flexibility there is the overhead to take a branch and run the correct code path.


            0000000000000b90 <__aarch64_ldadd4_acq_rel>:
 b90:   d503245f        bti     c
 b94:   d0000090        adrp    x16, 12000 <__data_start>
 b98:   39404610        ldrb    w16, [x16, #17]
 b9c:   34000070        cbz     w16, ba8 <__aarch64_ldadd4_acq_rel+0x18>
 ba0:   b8e00020        ldaddal w0, w0, [x1]
 ba4:   d65f03c0        ret
 ba8:   2a0003f0        mov     w16, w0
 bac:   885ffc20        ldaxr   w0, [x1]
 bb0:   0b100011        add     w17, w0, w16
 bb4:   880ffc31        stlxr   w15, w17, [x1]
 bb8:   35ffffaf        cbnz    w15, bac <__aarch64_ldadd4_acq_rel+0x1c>
 bbc:   d65f03c0        ret

As a final check, move back to the A1 instance and compile for armv8.2-a architecture. The atomic instruction is illegal on the Cortex-A72 and fails.


            gcc -g atomic.c -o a1 -march=armv8.2-a -lpthread

The result is:


        Illegal instruction (core dumped)


How can I find out if my application has atomic instructions?

To check for atomic instructions in applications run objdump on the T4g executable:


            objdump -d t4g | grep -i 'cas\|casp\|swp\|ldadd\|stadd\|ldclr\|stclr\|ldeor\|steor\|ldset\|stset\|ldsmax\|stsmax\|ldsmin\|stsmin\|ldumax\|stumax\|ldumin\|stumin' | wc -l

The above command will report a count of 1 instruction, the ldaddal instruction.

To check whether applications contain load exclusives and store exclusives run this command on the A1 executable. It will report a count of 2.


            objdump -d a1 | grep -i 'ldxr\|ldaxr\|stxr\|stlxr' | wc -l

Running on the t4g.outline executable which supports both architectures will report both types of instructions.

Another way to confirm an executable supports both architectures is to run the command:


            nm t4g.outline | grep __aarch64_have_lse_atomics | wc -l

If it returns a 1 then it was compiled with outline-atomics.


Large System Extensions introduce atomic instructions to improve performance for Arm systems with many processors. When migrating applications to Neoverse it helps to have an understanding of compilers, compiler options, and libraries. Also, think about the strategy for an application supporting only Neoverse or also including support for processors which don’t include LSE.