In this section, you will write an SME2 optimized matrix multiplication in C using the intrinsics that the compiler provides.
Intrinsics, also know known as compiler intrinsics or intrinsic functions, are the functions available to application developers that the compiler has an intimate knowledge of. This enables the compiler to either translate the function to a specific instruction or to perform specific optimizations, or both.
You can learn more about intrinsics in this Wikipedia Article on Intrinsic Function .
Using intrinsics allows the programmer to use the specific instructions required to achieve the required performance while writing in C all the typically-required standard code, such as loops. This produces performance close to what can be reached with hand-written assembly whilst being significantly more maintainable and portable.
All Arm-specific intrinsics are specified in the ACLE , which is the Arm C Language Extension. ACLE is supported by the main compilers, most notably GCC and Clang .
On the previous page, assembly language provided the programmer with full access to processor features. However, this comes at the cost of increased complexity and maintenance, particularly when managing large codebases with deeply nested function calls. Additionally, the assembly version operates at a very low level and does not fully handle the SME state.
In real-world large-scale software, the program moves back and forth from streaming mode, and some streaming mode routines call other streaming mode routines, which means that some state needs to be saved and restored. This includes the ZA storage. This is defined in the ACLE and
supported by the compiler: the programmer just has to annotate the function
with some keywords and set up some registers (see function setup_sme
in
misc.c
for an example). See
Introduction to streaming and non-streaming mode
for further information. The rest of this section references information from the ACLE.
The AArch64 architecture defines a concept called streaming mode, controlled
by a processor state bit called PSTATE.SM
. At any given point in time, the
processor is either in streaming mode (PSTATE.SM==1
) or in non-streaming mode
(PSTATE.SM==0
). There is an instruction called SMSTART
to enter streaming mode
and an instruction called SMSTOP
to return to non-streaming mode.
Streaming mode has three main effects on C and C++ code:
The C and C++ standards define the behavior of programs in terms of an abstract machine. As an extension, the ACLE specification applies the distinction between streaming mode and non-streaming mode to this abstract machine: at any given point in time, the abstract machine is either in streaming mode or in non-streaming mode.
This distinction between processor mode and abstract machine mode is mostly just
a specification detail. However, the usual “as if” rule applies: the
processor’s actual mode at runtime can be different from the abstract machine’s
mode, provided that this does not alter the behavior of the program. One
practical consequence of this is that C and C++ code does not specify the exact
placement of SMSTART
and SMSTOP
instructions; the source code simply places
limits on where such instructions go. For example, when stepping through a
program in a debugger, the processor mode might sometimes be different from the
one implied by the source code.
ACLE provides attributes that specify whether the abstract machine executes statements:
SME provides an area of storage called ZA, of size SVL.B
x SVL.B
bytes. It
also provides a processor state bit called PSTATE.ZA
to control whether ZA
is enabled.
In C and C++ code, access to ZA is controlled at function granularity: a function either uses ZA or it does not. Another way to say this is that a function either “has ZA state” or it does not.
If a function does have ZA state, the function can either share that ZA state with the function’s caller or create new ZA state “from scratch”. In the latter case, it is the compiler’s responsibility to free up ZA so that the function can use it; see the description of the lazy saving scheme in AAPCS64 for details about how the compiler does this.
Here again, a top level function named matmul_intr
in matmul_intr.c
will be used to stitch together the preprocessing and the multiplication:
__arm_new("za") __arm_locally_streaming void matmul_intr(
uint64_t M, uint64_t K, uint64_t N, const float *restrict matLeft,
const float *restrict matRight, float *restrict matLeft_mod,
float *restrict matResult) {
uint64_t SVL = svcntsw();
preprocess_l_intr(M, K, SVL, matLeft, matLeft_mod);
matmul_intr_impl(M, K, N, SVL, matLeft_mod, matRight, matResult);
}
Note the __arm_new("za")
and __arm_locally_streaming
at line 1 that will
make the compiler save the ZA storage so we can use it without destroying its
content if it was still in use by one of the callers.
SVL
, the dimension of the ZA storage, is requested from the underlying
hardware with the svcntsw()
function call at line 5, and passed down to the
preprocess_l_intr
and matmul_intr_impl
functions. svcntsw()
is a
function provided be the ACLE library.
void preprocess_l_intr(
uint64_t M, uint64_t K, uint64_t SVL, const float *restrict a,
float *restrict a_mod) __arm_streaming __arm_inout("za") {
const uint64_t M_mod = SVL * (M / SVL + (M % SVL != 0 ? 1 : 0));
// The outer loop, iterating over rows (M dimension)
for (uint64_t row = 0; row < M; row += SVL) {
svbool_t pMDim = svwhilelt_b32(row, M);
// The inner loop, iterating on columns (K dimension).
for (uint64_t col = 0; col < K; col += 2 * SVL) {
svcount_t pKDim = svwhilelt_c32(col, K, 2);
// Load-as-rows
for (uint64_t trow = 0; trow < SVL; trow += 4) {
svcount_t p0 = svpsel_lane_c32(pKDim, pMDim, trow + 0);
svcount_t p1 = svpsel_lane_c32(pKDim, pMDim, trow + 1);
svcount_t p2 = svpsel_lane_c32(pKDim, pMDim, trow + 2);
svcount_t p3 = svpsel_lane_c32(pKDim, pMDim, trow + 3);
const uint64_t tile_UL_corner = (row + trow) * K + col;
svfloat32x2_t zp0 = svld1_x2(p0, &a[tile_UL_corner + 0 * K]);
svfloat32x2_t zp1 = svld1_x2(p1, &a[tile_UL_corner + 1 * K]);
svfloat32x2_t zp2 = svld1_x2(p2, &a[tile_UL_corner + 2 * K]);
svfloat32x2_t zp3 = svld1_x2(p3, &a[tile_UL_corner + 3 * K]);
svfloat32x4_t zq0 = svcreate4(svget2(zp0, 0), svget2(zp1, 0),
svget2(zp2, 0), svget2(zp3, 0));
svfloat32x4_t zq1 = svcreate4(svget2(zp0, 1), svget2(zp1, 1),
svget2(zp2, 1), svget2(zp3, 1));
svwrite_hor_za32_f32_vg4(
/* tile: */ 0, /* slice: */ trow, zq0);
svwrite_hor_za32_f32_vg4(
/* tile: */ 1, /* slice: */ trow, zq1);
}
// Read-as-columns and store
const uint64_t dest_0 = row * K + col * SVL;
const uint64_t dest_1 = dest_0 + SVL * SVL;
for (uint64_t tcol = 0; tcol < SVL; tcol += 4) {
svcount_t p0 = svwhilelt_c32(dest_0 + tcol * SVL, K * M_mod, 4);
svcount_t p1 = svwhilelt_c32(dest_1 + tcol * SVL, K * M_mod, 4);
svfloat32x4_t zq0 =
svread_ver_za32_f32_vg4(/* tile: */ 0, /* slice: */ tcol);
svfloat32x4_t zq1 =
svread_ver_za32_f32_vg4(/* tile: */ 1, /* slice: */ tcol);
svst1(p0, &a_mod[dest_0 + tcol * SVL], zq0);
svst1(p1, &a_mod[dest_1 + tcol * SVL], zq1);
}
}
}
}
Note that preprocess_l_intr
has been annotated at line 3 with:
__arm_streaming
, because this function is using streaming instructions,
__arm_inout("za")
, because preprocess_l_intr
reuses the ZA storage
from its caller.
The matrix preprocessing is performed in a double nested loop, over the M
(line 7) and K
(line 12) dimensions of the input matrix a
. Both loops
have an SVL
step increment, which corresponds to the horizontal and vertical
dimensions of the ZA storage that will be used. The dimensions of a
may not
be perfect multiples of SVL
though… which is why the predicates pMDim
(line 9) and pKDim
(line 14) are computed in order to know which rows (respectively
columns) are valid.
The core of preprocess_l_intr
is made of two parts:
Lines 17 - 37: load matrix tile as rows. In this part, loop unrolling has been
used at 2 different levels. At the lowest level, 4 rows are loaded at a time
(lines 24-27). But this goes much further because as SME2 has multi-vectors
operations (hence the svld1_x2
intrinsic to load 2 rows in 2 vector
registers), this allows the function to load the consecutive row, which
happens to be the row from the neighboring tile on the right: this means two
tiles are processed at once. At line 29-32, the pairs of vector registers are
rearranged on quads of vector registers so they can be stored horizontally in
the two tiles’ ZA storage at lines 33-36 with the svwrite_hor_za32_f32_vg4
intrinsic. Of course, as the input matrix may not have dimensions that are
perfect multiples of SVL
, the p0
, p1
, p2
and p3
predicates
are computed with the svpsel_lane_c32
intrinsic (lines 18-21) so that
elements outside of the input matrix are set to 0 when they are loaded at
lines 24-27.
Lines 39 - 51: read the matrix tile as columns and store them. Now that the 2
tiles have been loaded horizontally, they will be read vertically with the
svread_ver_za32_f32_vg4
intrinsic to quad-registers of vectors (zq0
and zq1
) at lines 45-48 and then stored with the svst1
intrinsic to
the relevant location in the destination matrix a_mod
(lines 49-50). Note
again the usage of predicates p0
and p1
(computed at lines 43-44) to
svst1
to prevent writing out of the matrix bounds.
Using intrinsics simplifies function development significantly, provided one has a good understanding of the SME2 instruction set. Predicates, which are fundamental to SVE and SME, enable a natural expression of algorithms while handling corner cases efficiently. Notably, there is no explicit condition checking within the loops to account for rows or columns extending beyond matrix bounds.
void matmul_intr_impl(
uint64_t M, uint64_t K, uint64_t N, uint64_t SVL,
const float *restrict matLeft_mod, const float *restrict matRight,
float *restrict matResult) __arm_streaming __arm_inout("za") {
// Build the result matrix tile by tile.
for (uint64_t row = 0; row < M; row += SVL) {
svbool_t pMDim = svwhilelt_b32(row, M);
for (uint64_t col = 0; col < N; col += SVL) {
svbool_t pNDim = svwhilelt_b32(col, N);
// Outer product + accumulation
svzero_za();
const uint64_t matLeft_pos = row * K;
const uint64_t matRight_UL_corner = col;
for (uint64_t k = 0; k < K; k++) {
svfloat32_t zL =
svld1(pMDim, &matLeft_mod[matLeft_pos + k * SVL]);
svfloat32_t zR =
svld1(pNDim, &matRight[matRight_UL_corner + k * N]);
svmopa_za32_m(0, pMDim, pNDim, zL, zR);
}
// Store ZA to matResult.
const uint64_t result_tile_UL_corner = row * N + col;
for (uint64_t trow = 0; trow < SVL && row + trow < M; trow += 4) {
svbool_t p0 = svpsel_lane_b32(pNDim, pMDim, row + trow + 0);
svbool_t p1 = svpsel_lane_b32(pNDim, pMDim, row + trow + 1);
svbool_t p2 = svpsel_lane_b32(pNDim, pMDim, row + trow + 2);
svbool_t p3 = svpsel_lane_b32(pNDim, pMDim, row + trow + 3);
svst1_hor_za32(
/* tile: */ 0, /* slice: */ trow + 0, p0,
&matResult[result_tile_UL_corner + (trow + 0) * N]);
svst1_hor_za32(
/* tile: */ 0, /* slice: */ trow + 1, p1,
&matResult[result_tile_UL_corner + (trow + 1) * N]);
svst1_hor_za32(
/* tile: */ 0, /* slice: */ trow + 2, p2,
&matResult[result_tile_UL_corner + (trow + 2) * N]);
svst1_hor_za32(
/* tile: */ 0, /* slice: */ trow + 3, p3,
&matResult[result_tile_UL_corner + (trow + 3) * N]);
}
}
}
}
Note again that matmul_intr_impl
function has been annotated at line 4 with:
__arm_streaming
, because the function is using streaming instructions,
__arm_inout("za")
, because the function reuses the ZA storage from its caller.
The multiplication with the outer product is performed in a double-nested loop,
over the M
(line 7) and N
(line 11) dimensions of the input matrices
matLeft_mod
and matRight
. Both loops have an SVL
step increment,
which corresponds to the horizontal and vertical dimensions of the ZA storage
that will be used as one tile at a time will be processed. The M
and N
dimensions of the inputs may not be perfect multiples of SVL
so the
predicates pMDim
(line 9) (respectively pNDim
at line 13) are computed in order
to know which rows (respectively columns) are valid.
The core of the multiplication is done in 2 parts:
Outer-product and accumulation at lines 15-25. As matLeft
has been
laid-out perfectly in memory with preprocess_l_intr
, this part becomes
straightforward. First, the tile is zeroed with the svzero_za
intrinsics
at line 16 so the outer products can be accumulated in the tile. The outer
products are computed and accumulation over the K
common dimension with
the loop at line 19: the column of matleft_mod
and the row of matRight
are loaded with the svld1
intrinsics at line 20-23 to vector registers
zL
and zR
, which are then used at line 24 with the svmopa_za32_m
intrinsic to perform the outer product and accumulation (to tile 0). This
is exactly what was shown in Figure 2 earlier in the Learning Path.
Note again the usage of the pMDim
and pNDim
predicates to deal
correctly with the rows and columns respectively which are out of bounds.
Storing of the result matrix at lines 27-46. The previous section computed the matrix multiplication result for the current tile, which now needs
to be written back to memory. This is done with the loop at line 29 which will
iterate over all rows of the tile: the svst1_hor_za32
intrinsic at lines
35-46 stores directly from the tile to memory. Note that the loop has been
unrolled by a factor of 4 (thus the trow += 4
increment, line 29) and the
4 svst1_hor_za32
. Again, the pMDim
and pNDim
predicates deal
gracefully with the parts of the tile which are out-of-bound for the
destination matrix matResult
.
Once again, intrinsics makes it easy to fully leverage SME2, provided you have a solid understanding of its available instructions. Predicates handle corner cases elegantly, ensuring robust execution. Most importantly, the code adapts to different SVL values across various hardware implementations without requiring recompilation. This follows the key principle of compile-once, run-everywhere, allowing systems with larger SVL to execute computations more efficiently while using the same binary.
The main function is exactly the same that was used for the assembly version,
with the IMPL
macro defined to be intr
in the Makefile
.
First, make sure that the sme2_matmul_intr
executable is up-to-date:
docker run --rm -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v1 make sme2_matmul_intr
Then execute sme2_matmul_intr
on the FVP:
docker run --rm -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v1 ./run-fvp.sh sme2_matmul_intr
This should output something similar to:
SME2 Matrix Multiply fp32 *intr* example with args 125 35 70
Matrix preprocessing: PASS !
Matrix multiplication: PASS !
Info: /OSCI/SystemC: Simulation stopped by user.