To illustrate the structure and design principles of SIMD Loops, consider loop 202 as an example.
Use a text editor to open loops/loop_202.c
.
The function inner_loop_202()
is defined around lines 60–70 in loops/loop_202.c
and calls the matmul_fp32
routine defined in loops/matmul_fp32.c
Open loops/matmul_fp32.c
in your editor.
This loop implements single-precision floating-point matrix multiplication of the form:
C[M × N] = A[M × K] × B[K × N]
You can view matrix multiplication in two equivalent ways:
A
and each column of B
A
and the rows of B
The loop begins by defining a data structure that captures the matrix dimensions (M
, K
, N
) along with input and output buffers:
struct loop_202_data {
uint64_t m;
uint64_t n;
uint64_t k;
float *restrict a;
float *restrict b;
float *restrict c;
};
For this loop:
a
is stored in column-major orderb
is stored in row-major ordera
, b
, and c
do not alias, as indicated by the restrict
keywordThis layout helps optimize memory access patterns across the targeted SIMD architectures.
Loop attributes are specified per target architecture:
inner_loop_202
is invoked with the __arm_streaming
attribute and uses a shared ZA
register context (__arm_inout("za")
). These attributes are wrapped in the LOOP_ATTR
macroThis design enables portability across SIMD extensions.
loops/matmul_fp32.c
provides several optimizations of matrix multiplication, including ACLE intrinsics and hand-optimized assembly.
A scalar C implementation appears around lines 40–52. It follows the dot-product formulation and serves as both a functional reference and an auto-vectorization baseline:
for (uint64_t x = 0; x < m; x++) {
for (uint64_t y = 0; y < n; y++) {
c[x * n + y] = 0.0f;
}
}
// Loops ordered for contiguous memory access in inner loop
for (uint64_t z = 0; z < k; z++)
for (uint64_t x = 0; x < m; x++) {
for (uint64_t y = 0; y < n; y++) {
c[x * n + y] += a[z * m + x] * b[z * n + y];
}
}
The SVE version uses indexed floating-point multiply–accumulate (fmla
) to optimize the matrix multiplication operation. The outer product is decomposed into indexed multiply steps, and results accumulate directly in Z
registers.
In the intrinsics version (lines 167–210), the innermost loop is structured as follows:
for (m_idx = 0; m_idx < m; m_idx += 8) {
for (n_idx = 0; n_idx < n; n_idx += svcntw() * 2) {
ZERO_PAIR(0);
ZERO_PAIR(1);
ZERO_PAIR(2);
ZERO_PAIR(3);
ZERO_PAIR(4);
ZERO_PAIR(5);
ZERO_PAIR(6);
ZERO_PAIR(7);
ptr_a = &a[m_idx];
ptr_b = &b[n_idx];
while (ptr_a < cnd_k) {
lda_0 = LOADA_PAIR(0);
lda_1 = LOADA_PAIR(1);
ldb_0 = LOADB_PAIR(0);
ldb_1 = LOADB_PAIR(1);
MLA_GROUP(0);
MLA_GROUP(1);
MLA_GROUP(2);
MLA_GROUP(3);
MLA_GROUP(4);
MLA_GROUP(5);
MLA_GROUP(6);
MLA_GROUP(7);
ptr_a += m * 2;
ptr_b += n * 2;
}
ptr_c = &c[n_idx];
STORE_PAIR(0);
STORE_PAIR(1);
STORE_PAIR(2);
STORE_PAIR(3);
STORE_PAIR(4);
STORE_PAIR(5);
STORE_PAIR(6);
STORE_PAIR(7);
}
c += n * 8;
}
At the beginning of the loop, the accumulators (Z
registers) are zeroed using svdup
(or dup
in assembly), encapsulated in the ZERO_PAIR
macro.
Within each iteration over the K
dimension:
A
using replicate loads svld1rq
(or ld1rqw
), through LOADA_PAIR
B
using SVE vector loads, using LOADB_PAIR
fmla
operations compute element–vector products and accumulate into 16 Z
register accumulatorsAfter all K
iterations, results in the Z
registers are stored to C
using the STORE_PAIR
macro.
The equivalent SVE hand-optimized assembly appears around lines 478–598.
This loop shows how SVE registers and indexed fmla
enable efficient decomposition of the outer-product formulation into parallel, vectorized accumulation.
For SVE/SVE2 semantics and optimization guidance, see the Scalable Vector Extensions resources .
The SME2 implementation leverages the outer-product formulation of the matrix
multiplication function, utilizing the fmopa
SME instruction to perform the
outer-product and accumulate partial results in ZA
tiles.
A snippet of the loop is shown below:
#if defined(__ARM_FEATURE_SME2p1)
svzero_za();
#endif
for (m_idx = 0; m_idx < m; m_idx += svl_s * 2) {
for (n_idx = 0; n_idx < n; n_idx += svl_s * 2) {
#if !defined(__ARM_FEATURE_SME2p1)
svzero_za();
#endif
ptr_a = &a[m_idx];
ptr_b = &b[n_idx];
while (ptr_a < cnd_k) {
vec_a0 = svld1_x2(c_all, &ptr_a[0]);
vec_b0 = svld1_x2(c_all, &ptr_b[0]);
vec_a1 = svld1_x2(c_all, &ptr_a[m]);
vec_b1 = svld1_x2(c_all, &ptr_b[n]);
MOPA_TILE(0, 0, 0, 0);
MOPA_TILE(1, 0, 0, 1);
MOPA_TILE(2, 0, 1, 0);
MOPA_TILE(3, 0, 1, 1);
MOPA_TILE(0, 1, 0, 0);
MOPA_TILE(1, 1, 0, 1);
MOPA_TILE(2, 1, 1, 0);
MOPA_TILE(3, 1, 1, 1);
ptr_a += m * 2;
ptr_b += n * 2;
}
ptr_c = &c[n_idx];
for (l_idx = 0; l_idx < l_cnd; l_idx += 8) {
#if defined(__ARM_FEATURE_SME2p1)
vec_c0 = svreadz_hor_za8_u8_vg4(0, l_idx + 0);
vec_c1 = svreadz_hor_za8_u8_vg4(0, l_idx + 4);
#else
vec_c0 = svread_hor_za8_u8_vg4(0, l_idx + 0);
vec_c1 = svread_hor_za8_u8_vg4(0, l_idx + 4);
#endif
STORE_PAIR(0, 0, 1, 0);
STORE_PAIR(1, 0, 1, n);
STORE_PAIR(0, 2, 3, c_blk);
STORE_PAIR(1, 2, 3, c_off);
ptr_c += n * 2;
}
}
c += c_blk * 2;
}
Within the SME2 intrinsics code (lines 91–106), the innermost loop iterates across
the K
dimension - columns of A
and rows of B
.
In each iteration:
A
and two from B
(vec_a*
, vec_b*
) using multi-vector load intrinsicsfmopa
, wrapped by MOPA_TILE
, computes the outer productZA
tilesAfter all K
iterations, results are written back in a store loop (lines 111–124).
During this phase, rows of ZA
tiles are read into Z
vectors using svread_hor_za8_u8_vg4
(or svreadz_hor_za8_u8_vg4
on SME2.1). Vectors are then stored to the output buffer using SME multi-vector st1w
stores using STORE_PAIR
.
The equivalent SME2 hand-optimized assembly appears around lines 229–340.
For instruction semantics and SME/SME2 optimization guidance, see the SME Programmer’s Guide .
Beyond the SME2 and SVE implementations, this loop also includes additional optimized versions that leverage architecture-specific features:
NEON: the NEON version (lines 612–710) uses structure load/store combined with indexed fmla
to vectorize the computation.
SVE2.1: the SVE2.1 version (lines 355–462) extends the base SVE approach using multi-vector loads and stores.
SME2.1: the SME2.1 version uses movaz
/svreadz_hor_za8_u8_vg4
to reinitialize ZA
tile accumulators while moving data out to registers.