Introduction
Overview
Set up your SME2 development environment
Test your SME2 development environment
Streaming mode and ZA state in SME
Vanilla matrix multiplication
Outer product
SME2 assembly matrix multiplication
Matrix multiplication using SME2 intrinsics in C
Benchmarking
Debugging
Going further
Next Steps
In this section, you’ll implement a basic matrix multiplication algorithm in C using row-major memory layout. This version acts as a reference implementation that you’ll use to validate the correctness of optimized versions later in the Learning Path.
The vanilla matrix multiplication operation takes two input matrices:
Ar
rows x Ac
columns]Br
rows x Bc
columns]It produces an output matrix C [Cr
rows x Cc
columns].
The algorithm works by iterating over each row of A and each column of B. It multiplies the corresponding elements and sums the products to generate each element of matrix C, as shown in the figure below.
The diagram below shows how matrix C is computed by iterating over rows of A and columns of B:
Figure 2: Standard matrix multiplication.
This implies that the A, B, and C matrices have some constraints on their dimensions:
Ac == Br
.For more information about matrix multiplication, including its history, properties and use, see this Wikipedia article on Matrix Multiplication .
The following variable names are used throughout the Learning Path to represent matrix dimensions and operands:
matLeft
corresponds to the left-hand side argument of the matrix multiplication.matRight
corresponds to the right-hand side of the matrix multiplication.M
is matLeft
number of rows.K
is matLeft
number of columns (and matRight
number of rows).N
is matRight
number of columns.matResult
corresponds to the result of the matrix multiplication, with M
rows and N
columns.Here is the full reference implementation from matmul_vanilla.c
:
void matmul(uint64_t M, uint64_t K, uint64_t N,
const float *restrict matLeft, const float *restrict matRight,
float *restrict matResult) {
for (uint64_t m = 0; m < M; m++) {
for (uint64_t n = 0; n < N; n++) {
float acc = 0.0;
for (uint64_t k = 0; k < K; k++)
acc += matLeft[m * K + k] * matRight[k * N + n];
matResult[m * N + n] = acc;
}
}
}
In this Learning Path, the matrices are laid out in memory as contiguous sequences of elements, in
row-major order
. The matmul
function performs the algorithm described above.
The pointers to matLeft
, matRight
and matResult
have been annotated as restrict
, which informs the compiler that the memory areas designated by those pointers do not alias. This means that they do not overlap in any way, so that the compiler does not need to insert extra instructions to deal with these cases. The pointers to matLeft
and matRight
are marked as const
as neither of these two matrices are modified by matmul
.
This function gives you a working baseline for matrix multiplication. You’ll use it later in the Learning Path to verify the correctness of optimized implementations using SME2 intrinsics and assembly.