Introduction
Overview
Set up your SME2 development environment
Test your SME2 development environment
Streaming mode and ZA state in SME
Vanilla matrix multiplication
Outer product
SME2 assembly matrix multiplication
Matrix multiplication using SME2 intrinsics in C
Benchmarking
Debugging
Going further
Next Steps
In this section, you will write an SME2-optimized matrix multiplication routine in C using the intrinsics that the compiler provides.
Intrinsics, also known as compiler intrinsics or intrinsic functions, are the functions available to application developers that the compiler has intimate knowledge of. This enables the compiler to either translate the function to a specific instruction or to perform specific optimizations, or both.
You can learn more about intrinsics in this Wikipedia Article on Intrinsic Function .
Using intrinsics allows you to write performance-critical code in C while still using standard constructs like loops. This produces performance close to what can be reached with hand-written assembly whilst being significantly more maintainable and portable.
All Arm-specific intrinsics are specified in the ACLE , which is the Arm C Language Extension. ACLE is supported by the main compilers, most notably GCC and Clang .
In this example, a top-level function named matmul_intr
, defined in matmul_intr.c
, brings together the preprocessing and matrix multiplication steps:
__arm_new("za") __arm_locally_streaming void matmul_intr(
uint64_t M, uint64_t K, uint64_t N, const float *restrict matLeft,
const float *restrict matRight, float *restrict matLeft_mod,
float *restrict matResult) {
uint64_t SVL = svcntsw();
preprocess_l_intr(M, K, SVL, matLeft, matLeft_mod);
matmul_intr_impl(M, K, N, SVL, matLeft_mod, matRight, matResult);
}
Note the use of __arm_new("za")
and __arm_locally_streaming
at line 1. These attributes ensure that the compiler saves the ZA storage, allowing your function to use it safely without destroying its content if it was still in use by one of the callers.
SVL
, the dimension of the ZA storage, is requested from the underlying hardware with the svcntsw()
function call at line 5, and passed down to the preprocess_l_intr
and matmul_intr_impl
functions. svcntsw()
is a function provided by the ACLE library.
void preprocess_l_intr(
uint64_t M, uint64_t K, uint64_t SVL, const float *restrict a,
float *restrict a_mod) __arm_streaming __arm_inout("za") {
const uint64_t M_mod = SVL * (M / SVL + (M % SVL != 0 ? 1 : 0));
// The outer loop, iterating over rows (M dimension)
for (uint64_t row = 0; row < M; row += SVL) {
svbool_t pMDim = svwhilelt_b32(row, M);
// The inner loop, iterating on columns (K dimension).
for (uint64_t col = 0; col < K; col += 2 * SVL) {
svcount_t pKDim = svwhilelt_c32(col, K, 2);
// Load-as-rows
for (uint64_t trow = 0; trow < SVL; trow += 4) {
svcount_t p0 = svpsel_lane_c32(pKDim, pMDim, trow + 0);
svcount_t p1 = svpsel_lane_c32(pKDim, pMDim, trow + 1);
svcount_t p2 = svpsel_lane_c32(pKDim, pMDim, trow + 2);
svcount_t p3 = svpsel_lane_c32(pKDim, pMDim, trow + 3);
const uint64_t tile_UL_corner = (row + trow) * K + col;
svfloat32x2_t zp0 = svld1_x2(p0, &a[tile_UL_corner + 0 * K]);
svfloat32x2_t zp1 = svld1_x2(p1, &a[tile_UL_corner + 1 * K]);
svfloat32x2_t zp2 = svld1_x2(p2, &a[tile_UL_corner + 2 * K]);
svfloat32x2_t zp3 = svld1_x2(p3, &a[tile_UL_corner + 3 * K]);
svfloat32x4_t zq0 = svcreate4(svget2(zp0, 0), svget2(zp1, 0),
svget2(zp2, 0), svget2(zp3, 0));
svfloat32x4_t zq1 = svcreate4(svget2(zp0, 1), svget2(zp1, 1),
svget2(zp2, 1), svget2(zp3, 1));
svwrite_hor_za32_f32_vg4(
/* tile: */ 0, /* slice: */ trow, zq0);
svwrite_hor_za32_f32_vg4(
/* tile: */ 1, /* slice: */ trow, zq1);
}
// Read-as-columns and store
const uint64_t dest_0 = row * K + col * SVL;
const uint64_t dest_1 = dest_0 + SVL * SVL;
for (uint64_t tcol = 0; tcol < SVL; tcol += 4) {
svcount_t p0 = svwhilelt_c32(dest_0 + tcol * SVL, K * M_mod, 4);
svcount_t p1 = svwhilelt_c32(dest_1 + tcol * SVL, K * M_mod, 4);
svfloat32x4_t zq0 =
svread_ver_za32_f32_vg4(/* tile: */ 0, /* slice: */ tcol);
svfloat32x4_t zq1 =
svread_ver_za32_f32_vg4(/* tile: */ 1, /* slice: */ tcol);
svst1(p0, &a_mod[dest_0 + tcol * SVL], zq0);
svst1(p1, &a_mod[dest_1 + tcol * SVL], zq1);
}
}
}
}
Note that preprocess_l_intr
has been annotated at line 3 with:
__arm_streaming
- because this function is using streaming instructions
__arm_inout("za")
- because preprocess_l_intr
reuses the ZA storage
from its caller
The matrix preprocessing is performed in a double-nested loop, over the M
(line 7) and K
(line 12) dimensions of the input matrix a
. Both loops have an SVL
step increment, which corresponds to the horizontal and vertical dimensions of the ZA storage that will be used. The dimensions of a
might not be perfect multiples of SVL
however, which is why the predicates pMDim
(line 9) and pKDim
(line 14) are computed in order to know which rows (respectively columns) are valid.
The core of preprocess_l_intr
is made of two parts:
Lines 17 - 37: load matrix tile as rows. In this part, loop unrolling has been used at two different levels. At the lowest level, 4 rows are loaded at a time (lines 24-27). But this goes much further because as SME2 has multi-vectors operations (hence the svld1_x2
intrinsic to load 2 rows in 2 vector registers), this allows the function to load the consecutive row, which happens to be the row from the neighboring tile on the right: this means two tiles are processed at once. At lines 29-32, the pairs of vector registers are rearranged on quads of vector registers so they can be stored horizontally in the two tiles’ ZA storage at lines 33-36 with thesvwrite_hor_za32_f32_vg4
intrinsic. Of course, as the input matrix might not have dimensions that are perfect multiples of SVL
, the p0
, p1
, p2
and p3
predicates are computed with the svpsel_lane_c32
intrinsic (lines 18-21) so that elements outside of the input matrix are set to 0 when they are loaded at lines 24-27.
Lines 39 - 51: read the matrix tile as columns and store them. Now that the two tiles have been loaded horizontally, they will be read vertically with the svread_ver_za32_f32_vg4
intrinsic to quad-registers of vectors (zq0
and zq1
) at lines 45-48 and then stored with the svst1
intrinsic to the relevant location in the destination matrix a_mod
(lines 49-50). Note again the usage of predicates p0
and p1
(computed at lines 43-44) to svst1
to prevent writing out of the matrix bounds.
Using intrinsics simplifies function development, provided you have a good understanding of the SME2 instruction set. Predicates, which are fundamental to both SVE and SME, allow you to express algorithms cleanly while handling corner cases efficiently. Notably, the loops for include no explicit condition checks for rows or columns that extend beyond matrix bounds.
void matmul_intr_impl(
uint64_t M, uint64_t K, uint64_t N, uint64_t SVL,
const float *restrict matLeft_mod, const float *restrict matRight,
float *restrict matResult) __arm_streaming __arm_inout("za") {
// Build the result matrix tile by tile.
for (uint64_t row = 0; row < M; row += SVL) {
svbool_t pMDim = svwhilelt_b32(row, M);
for (uint64_t col = 0; col < N; col += SVL) {
svbool_t pNDim = svwhilelt_b32(col, N);
// Outer product + accumulation
svzero_za();
const uint64_t matLeft_pos = row * K;
const uint64_t matRight_UL_corner = col;
for (uint64_t k = 0; k < K; k++) {
svfloat32_t zL =
svld1(pMDim, &matLeft_mod[matLeft_pos + k * SVL]);
svfloat32_t zR =
svld1(pNDim, &matRight[matRight_UL_corner + k * N]);
svmopa_za32_m(0, pMDim, pNDim, zL, zR);
}
// Store ZA to matResult.
const uint64_t result_tile_UL_corner = row * N + col;
for (uint64_t trow = 0; trow < SVL && row + trow < M; trow += 4) {
svbool_t p0 = svpsel_lane_b32(pNDim, pMDim, row + trow + 0);
svbool_t p1 = svpsel_lane_b32(pNDim, pMDim, row + trow + 1);
svbool_t p2 = svpsel_lane_b32(pNDim, pMDim, row + trow + 2);
svbool_t p3 = svpsel_lane_b32(pNDim, pMDim, row + trow + 3);
svst1_hor_za32(
/* tile: */ 0, /* slice: */ trow + 0, p0,
&matResult[result_tile_UL_corner + (trow + 0) * N]);
svst1_hor_za32(
/* tile: */ 0, /* slice: */ trow + 1, p1,
&matResult[result_tile_UL_corner + (trow + 1) * N]);
svst1_hor_za32(
/* tile: */ 0, /* slice: */ trow + 2, p2,
&matResult[result_tile_UL_corner + (trow + 2) * N]);
svst1_hor_za32(
/* tile: */ 0, /* slice: */ trow + 3, p3,
&matResult[result_tile_UL_corner + (trow + 3) * N]);
}
}
}
}
Note again that matmul_intr_impl
function has been annotated at line 4 with:
__arm_streaming
, because the function is using streaming instructions
__arm_inout("za")
, because the function reuses the ZA storage from its caller
The multiplication with the outer product is performed in a double-nested loop, over the M
(line 7) and N
(line 11) dimensions of the input matrices matLeft_mod
and matRight
. Both loops have an SVL
step increment, which corresponds to the horizontal and vertical dimensions of the ZA storage that will be used as one tile at a time will be processed.
The M
and N
dimensions of the inputs might not be perfect multiples of SVL
so the predicates pMDim
(line 9) (respectively pNDim
at line 13) are computed in order to know which rows (respectively columns) are valid.
The core of the multiplication is done in two parts:
Outer-product and accumulation at lines 15-25. As matLeft
has been laid out perfectly in memory with preprocess_l_intr
, this part becomes straightforward. First, the tile is zeroed with the svzero_za
intrinsics at line 16 so the outer products can be accumulated in the tile. The outer
products are computed and accumulation over the K
common dimension with the loop at line 19: the column of matleft_mod
and the row of matRight
are loaded with the svld1
intrinsics at line 20-23 to vector registers zL
and zR
, which are then used at line 24 with the svmopa_za32_m
intrinsic to perform the outer product and accumulation (to tile 0). This
is exactly what was shown in Figure 2 earlier in the Learning Path. Note again the usage of the pMDim
and pNDim
predicates to deal correctly with the rows and columns respectively which are out of bounds.
Storing of the result matrix at lines 27-46. The previous section computed the
matrix multiplication result for the current tile, which now needs to be
written back to memory. This is done with the loop at line 29 which will
iterate over all rows of the tile: the svst1_hor_za32
intrinsic at lines
35-46 stores directly from the tile to memory. Note that the loop has been
unrolled by a factor of 4 (thus the trow += 4
increment, line 29) and the
4 svst1_hor_za32
. Again, the pMDim
and pNDim
predicates deal
gracefully with the parts of the tile which are out-of-bound for the
destination matrix matResult
.
Once again, intrinsics makes it easy to fully leverage SME2, provided you have a
solid understanding of its available instructions. The compiler is automatically
handling many low-level aspects (saving / restoring of the different contexts),
as well as not using registers that are reserved on specific platforms (like
x18
). Predicates handle corner cases elegantly, ensuring robust execution.
Most importantly, the code adapts to different SVL values across various
hardware implementations without requiring recompilation. This follows the key
principle of compile-once, run-everywhere, allowing systems with larger SVL to
execute computations more efficiently while using the same binary.
The main function is exactly the same that was used for the assembly version,
with the IMPL
macro defined to be intr
in the Makefile
.
First, make sure that the sme2_matmul_intr
executable is up-to-date:
make sme2_matmul_intr
__output__ make: `sme2_matmul_intr' is up to date.
docker run --rm -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v2 make sme2_matmul_intr
__output__ make: 'sme2_matmul_intr' is up to date.
Then execute sme2_matmul_intr
either natively or on the FVP:
./sme2_matmul_intr
__output__ SME2 Matrix Multiply fp32 *intr* [verification mode] with M=125, K=70, N=35
__output__ Matrix preprocessing: PASS !
__output__ Matrix multiplication: PASS !
docker run --rm -v "$PWD:/work" -w /work armswdev/sme2-learning-path:sme2-environment-v2 ./run-fvp.sh sme2_matmul_intr
__output__ SME2 Matrix Multiply fp32 *intr* [verification mode] with M=125, K=70, N=35
__output__ Matrix preprocessing: PASS !
__output__ Matrix multiplication: PASS !
__output__
__output__ Info: /OSCI/SystemC: Simulation stopped by user.