Introduction
Overview
Set up your SME2 development environment
Test your SME2 development environment
Streaming mode and ZA state in SME
Vanilla matrix multiplication
Outer product
SME2 assembly matrix multiplication
Matrix multiplication using SME2 intrinsics in C
Benchmarking
Debugging
Going further
Next Steps
Programs can switch between streaming and non-streaming mode during execution. When one streaming-mode function calls another, parts of the processor state - such as ZA storage - might need to be saved and restored. This behavior is governed by the Arm C Language Extensions (ACLE) and is managed by the compiler.
To use streaming mode, you simply annotate the relevant functions with the appropriate keywords. The compiler handles the low-level mechanics of streaming mode management, removing the need for error-prone, manual work.
For more information, see the Introduction to streaming and non-streaming mode . The rest of this section references content from the ACLE specification.
Streaming mode changes how the processor and compiler manage execution context. Here’s how it works:
The AArch64 architecture defines a concept called streaming mode, controlled
by a processor state bit PSTATE.SM
.
At any given point in time, the processor is either in streaming mode (PSTATE.SM == 1
) or in non-streaming mode (PSTATE.SM == 0
).
To enter streaming mode, there is the instruction SMSTART
, and to return to non-streaming mode, the instruction is SMSTOP
.
Streaming mode affects C and C++ code in the following ways:
The ACLE specification extends the C and C++ abstract machine model to include streaming mode. At any given time, the abstract machine is either in streaming or non-streaming mode.
This distinction between abstract machine mode and processor mode is mostly a specification detail. At runtime, the processor’s mode may differ from the abstract machine’s mode - as long as the observable program behavior remains consistent (as per the “as-if” rule).
One practical consequence of this is that C and C++ code does not specify the exact placement of SMSTART
and SMSTOP
instructions; the source code simply places limits on where such instructions go. For example, when stepping through a program in a debugger, the processor mode might sometimes be different from the one implied by the source code.
ACLE provides attributes that specify whether the abstract machine executes statements:
SME also introduces a matrix storage area called ZA, sized SVL.B
× SVL.B
bytes. It
also provides a processor state bit called PSTATE.ZA
to control whether ZA
is enabled.
In C and C++, ZA usage is specified at the function level: a function either uses ZA or it doesn’t. That is, a function either has ZA state or it does not.
Functions that use ZA can either:
When new state is needed, the compiler is responsible for preserving the caller’s state using a lazy saving scheme. For more information, see the AAPCS64 section of the ACLE spec .