Modern CPUs have vector units that operate in a SIMD fashion. This greatly improves application performance, depending on the vector width.
The Armv7-A Instruction Set Architecture (ISA) introduced Advanced SIMD or Arm NEON instructions. These instructions are supported on the latest Armv8-A and Armv9-A architectures. NEON registers are composed of 32 128-bit registers V0-V31 and support multiple data types: integer, single-precision (SP) floating-point and double-precision (DP) floating-point.
In order to reduce restrictions regarding fixed-length vector sizes, Arm introduced the Scalable Vector Extension (SVE). Arm SVE is vector-length agnostic, allowing vector width from 128 up to 2048 bits. This enables software to scale dynamically to any SVE capable Arm hardware.
SVE is not an extension of NEON but a separate, optional extension of Arm v8-A with a new set of instruction encodings. SVE is used in HPC and general-purpose server software. SVE2 adds capabilities to enable more data-processing domains.
Even though SVE is vector length agnostic, developers are interested in vector length. Sometimes performance may differ between machines and vector length could be a factor impacting the performance. Operating systems and hypervisors may reduce the vector length for applications, so it’s a good idea to confirm the vector length while investigating SVE performance.
Using a text editor of your choice, copy the program below to file named sve.c
:
#include <stdio.h>
#include <arm_sve.h>
#ifndef __ARM_FEATURE_SVE
#warning "Make sure to compile for SVE!"
#endif
int main()
{
printf("SVE vector length is: %ld bytes\n", svcntb());
}
This program prints the vector length. Compile the program using the gcc SVE architecture flag for SVE as shown below. Without the SVE flag the code will not compile.
gcc -march=armv8-a+sve sve.c -o sve
Run the program:
./sve
On AWS Graviton3 processors, based on the Neoverse V2, the output is:
SVE vector length is: 32 bytes
If the hardware doesn’t support SVE the program will crash with an illegal instruction.
SVE is a predicate-centric architecture with:
Take a look at the example code compiled for SVE (left) and for NEON (right):
Notice how small the SVE assembly is in comparison to NEON. This is due to the predicate behaviour which avoids generating assembly for remainder loops (scalar operations performed when the iteration domain is not a multiple of the vector length.
Observe what the SVE assembly instructions are doing:
The first 4 lines initialize the registers R2, R3, R4. R2 corresponds to the array size, R3 to the loop index, and R4 to the vector length.
The instruction whilelo
initializes the predicate register as follows:
The ld1d
instructions perform memory loads: these are predicated instructions. Starting from the index stored in R3 and if the corresponding predicate lanes in P0 are active, load elements from the the array A (R0) in Z0 and B (R1) in Z1.
The next instruction fadd
is not predicated: values in Z0 and Z1 are added and the result is stored in Z0. However, only the predicated values are stored in array B (R1) in memory with st1w
.
Finally, the value of the loop index R3 is incremented by the vector length R4 and the value of the loop index is updated:
If all lanes are inactive, break the loop.
This example illustrates the logic behind SVE: it keeps the code simple and increases vectorization.