Accelerate Matrix Multiplication Performance with SME2

Log an issue

Fork and edit

Discuss on Discord

Accelerate Matrix Multiplication Performance with SME2

About this Learning Path

Skill level:	Advanced
Reading time:	1 hr
Last updated:	11 Jul 2025

Skill level:

Advanced

Reading time:

1 hr

Last updated:

11 Jul 2025

Author:	Arnaud de Grandmaison, Arm
Arm IP:	Neoverse Cortex-A
Tags:	Performance and Architecture Linux macOS Windows C Clang Runbook LLVM

Author:

Arnaud de Grandmaison, Arm

Arm IP:

Neoverse Cortex-A

Tags:

Performance and Architecture

Linux

macOS

Windows

Clang

Runbook

LLVM

Who is this for?

This Learning Path is an advanced topic for developers who want to accelerate the performance of matrix multiplication using Arm's Scalable Matrix Extension Version 2 (SME2).

What will you learn?

Upon completion of this learning path, you will be able to:

Implement a baseline matrix multiplication kernel in C without SME2
Use SME2 assembly instructions to accelerate matrix multiplication performance
Use SME2 intrinsics to vectorize and optimize matrix multiplication
Compile code with SME2 intrinsics and assembly
Benchmark and validate SME2-accelerated matrix multiplication on Arm hardware or in a Linux-based emulation environment
Compare performance metrics between baseline and SME2-optimized implementations

Prerequisites

Before starting, you will need the following:

Working knowledge of Arm’s SVE and SME2 instruction sets
Intermediate proficiency with the C programming language and the Armv9-A assembly language
A computer running Linux, macOS, or Windows
Installations of Git and Docker for project setup and emulation
A platform that supports SME2 - see the list of devices with SME2 support or an emulator to run code with SME2 instructions
Compiler support for SME2 instructions (for example, LLVM 17+ with SME2 backend support)

Accelerate Matrix Multiplication Performance with SME2

Introduction

Overview

Set up your SME2 development environment

Test your SME2 development environment

Streaming mode and ZA state in SME

Vanilla matrix multiplication

Outer product

SME2 assembly matrix multiplication

Matrix multiplication using SME2 intrinsics in C

Benchmarking

Debugging

Going further

Next Steps

Accelerate Matrix Multiplication Performance with SME2

About this Learning Path

Who is this for?

What will you learn?

Prerequisites