Introduction
Overview
Set up your SME2 development environment
Test your SME2 development environment
Streaming mode and ZA state in SME
Vanilla matrix multiplication
Outer product
SME2 assembly matrix multiplication
Matrix multiplication using SME2 intrinsics in C
Benchmarking
Debugging
Going further
Next Steps
In this section, you’ll benchmark matrix multiplication performance using SME2, if your machine supports native execution of SME2 instructions.
Emulation is generally not the best way to assess the performance of a piece of
code. Emulation focuses on correctly simulating instructions and not accurate
execution timing. For example, as explained in the
outer product
section
, improving performance involves increasing the
macc-to-load ratio.
Emulators, including the FVP, do not model in detail memory bandwidth, cache behavior, or latency. At best, an emulator provides an instruction count for the vanilla reference implementation versus the assembly-/intrinsic-based versions of the matrix multiplication, which is useful for functional validation but not for precise benchmarking.
Benchmarking and profiling are complex tasks. This Learning Path provides a simplified framework for observing SME2-related performance improvements.
If your machine natively supports SME2, or if the target Android phone supports
SMES, then benchmarking is possible. When sme2_matmul_asm and
sme2_matmul_intr were compiled with BAREMETAL=0, the benchmarking mode is
available.
Benchmarking mode is enabled by prepending the M, K, N optional
parameters with an iteration count (I).
Now measure the execution time of sme2_matmul_intr for 1000 multiplications of
matrices with the default sizes:
./build-native/sme2_matmul_intr 1000
__output__SME2 Matrix Multiply fp32 *intr* [benchmarking mode, 1000 iterations] with M=125, K=70, N=35
__output__Reference implementation: min time = 101 us, max time = 438 us, avg time = 139.42 us
__output__SME2 implementation *intr*: min time = 1 us, max time = 8 us, avg time = 1.82 us
adb push build-android/sme2_matmul_intr /data/local/tmp
__output__build-android/sme2_matmul_intr: 1 file pushed, 0 skipped. 29.7 MB/s (19456 bytes in 0.001s)
adb shell chmod 755 /data/local/tmp/sme2_matmul_intr
adb shell /data/local/tmp/sme2_matmul_intr 1000
__output__SME2 Matrix Multiply fp32 *intr* [benchmarking mode, 1000 iterations] with M=125, K=70, N=35
__output__Reference implementation: min time = 115 us, max time = 808 us, avg time = 117.98 us
__output__SME2 implementation *intr*: min time = 5 us, max time = 21 us, avg time = 6.56 us
The execution time is reported in microseconds. A wide spread between the minimum and maximum figures can be noted and is expected as the way of doing the benchmarking is simplified for the purpose of simplicity. You will, however, note that the intrinsic version of the matrix multiplication brings a significant reduction in time, from 18x to 76x depending on the device.
You can override the default values for M (125), K (25), and N (70) and
provide your own values on the command line. For example, you can benchmark the
M=7, K=8, and N=9 case with:
./build-native/sme2_matmul_intr 1000 7 8 9
__output__SME2 Matrix Multiply fp32 *intr* [benchmarking mode, 1000 iterations] with M=7, K=8, N=9
__output__Reference implementation: min time = 0 us, max time = 14 us, avg time = 0.93 us
__output__SME2 implementation *intr*: min time = 0 us, max time = 1 us, avg time = 0.61 us
adb shell /data/local/tmp/sme2_matmul_intr 1000 7 8 9
__output__SME2 Matrix Multiply fp32 *intr* [benchmarking mode, 1000 iterations] with M=7, K=8, N=9
__output__Reference implementation: min time = 0 us, max time = 1 us, avg time = 0.37 us
__output__SME2 implementation *intr*: min time = 0 us, max time = 8 us, avg time = 0.32 us
Now measure the execution time of sme2_matmul_asm for 1000 multiplications of
matrices with the default sizes:
./build-native/sme2_matmul_asm 1000
__output__SME2 Matrix Multiply fp32 *asm* [benchmarking mode, 1000 iterations] with M=125, K=70, N=35
__output__Reference implementation: min time = 101 us, max time = 373 us, avg time = 136.49 us
__output__SME2 implementation *asm*: min time = 1 us, max time = 8 us, avg time = 1.44 us
adb push build-android/sme2_matmul_asm /data/local/tmp
__output__build-android/sme2_matmul_asm: 1 file pushed, 0 skipped. 29.7 MB/s (19456 bytes in 0.001s)
adb shell chmod 755 /data/local/tmp/sme2_matmul_asm
adb shell /data/local/tmp/sme2_matmul_asm 1000
__output__SME2 Matrix Multiply fp32 *asm* [benchmarking mode, 1000 iterations] with M=125, K=70, N=35
__output__Reference implementation: min time = 114 us, max time = 754 us, avg time = 117.91 us
__output__SME2 implementation *asm*: min time = 3 us, max time = 18 us, avg time = 4.14 us
You’ll notice that although the vanilla reference matrix multiplication is the same, there is some variability in the execution time.
The assembly version of the SME2 matrix multiplication runs slightly faster. However, this should not lead you to be convinced that assembly is inherently better as this is not an apples-to-apples comparison:
K parameter
that the intrinsics version does not.