Migrate SIMD code to the Arm architecture

Migrate x86-64 SIMD to Arm64

Log an issue

Fork and edit

Discuss on Discord

Migrate x86-64 SIMD to Arm64

Vectorization on x86 and Arm

Migrating SIMD (Single Instruction, Multiple Data) code from x86 extensions to Arm extensions is a key task for software developers aiming to optimize performance on Arm platforms.

Understanding the mapping from x86 instruction sets such as SSE, AVX, and AMX to Arm’s NEON, SVE, and SME extensions is essential for achieving portability and high performance. This Learning Path provides an overview to help you design a migration plan in which you can leverage Arm features such as scalable vector lengths and advanced matrix operations to adapt your code effectively.

Vectorization is a key optimization strategy where one instruction processes multiple data elements simultaneously. It drives performance in High-Performance Computing (HPC), AI and ML, signal processing, and data analytics.

Both x86 and Arm processors offer rich SIMD capabilities, but they differ in philosophy and design. The x86 architecture provides fixed-width vector units of 128, 256, and 512 bits. The Arm architecture offers fixed-width vectors for NEON and scalable vectors for SVE and SME, ranging from 128 to 2048 bits.

If you are migrating SIMD software to Arm, understanding these differences will help you write portable, high-performance code.

Arm vector and matrix extensions

This section provides some more information about the Arm vector and matrix extensions and shows you when to use each, how they map from SSE/AVX/AMX, and what changes in your programming model (predication, gather/scatter, tiles, streaming mode).

NEON

NEON is a 128-bit SIMD extension available across Armv8-A cores, including Neoverse and mobile. It is well suited to multimedia, DSP, and packet processing. Conceptually, NEON is closest to x86 SSE and AVX used in 128-bit mode, making it the primary target when migrating many SSE workloads. Compiler auto-vectorization to NEON is mature, reducing the need for manual intrinsics.

Scalable Vector Extension (SVE)

SVE introduces a revolutionary approach to SIMD with its vector-length agnostic (VLA) design. Registers in SVE can range from 128 to 2048 bits, with the exact width determined by the hardware implementation in multiples of 128 bits. This flexibility allows the same binary to run efficiently across different hardware generations. SVE also features advanced capabilities like per-element predication, which eliminates branch divergence, and native support for gather/scatter operations, enabling efficient handling of irregular memory accesses. While SVE is ideal for high-performance computing (HPC) and future-proof portability, developers must adapt to its unique programming model, which differs significantly from fixed-width SIMD paradigms. SVE is most similar to AVX-512 on x86, offering greater portability and scalability.

Scalable Matrix Extension (SME)

SME accelerates matrix multiplication and is similar in intent to AMX. Unlike AMX, which often uses dot-product oriented operations, SME employs outer-product oriented operations. SME integrates with SVE, using scalable tiles and a streaming mode to optimize performance. It is well suited to AI training and inference, as well as dense linear algebra in HPC applications.

x86 vector and matrix extensions

Here is a brief overview of the x86 families you’ll likely port from: SSE (128-bit), AVX/AVX-512 (256/512-bit with masking), and AMX (tile-based matrix compute). Use this to identify feature equivalents before mapping kernels to NEON, SVE/SVE2, or SME on Arm.

Streaming SIMD Extensions (SSE)

The SSE instruction set provides 128-bit XMM registers and supports both integer and floating-point SIMD operations. Despite being an older technology, SSE remains a baseline for many libraries due to its widespread adoption.

However, its fixed-width design can constrain throughput compared with newer extensions like AVX. When migrating code from SSE to Arm, developers will find that SSE maps well to Arm NEON, enabling a relatively straightforward transition.

Advanced Vector Extensions (AVX)

AVX provides 256-bit YMM registers, and AVX-512 adds 512-bit ZMM registers. Features include FMA, per-lane masking in AVX-512, and VEX or EVEX encodings. When moving AVX workloads to Arm, 128-bit paths often translate to NEON, while algorithms that scale with vector width are good candidates for SVE. Because SVE is vector-length agnostic, refactor for predication and scalable loops to maintain portability and performance.

Advanced Matrix Extensions (AMX)

AMX accelerates matrix operations with tile registers configured using a tile palette. It suits AI workloads such as GEMM and convolutions. When migrating AMX kernels to Arm, target SME. While both target matrix compute, AMX commonly expresses dot products, while SME focuses on outer products, so porting often entails algorithmic adjustments.

Comparison tables

Use these side-by-side tables to pick the right Arm target and plan refactors. They compare register width, predication/masking, gather/scatter, key operations, typical workloads, and limitations for SSE ↔ NEON, AVX/AVX-512 ↔ SVE/SVE2, and AMX ↔ SME.

A comparison of SSE and NEON

Feature	SSE	NEON
Register width	128-bit (XMM)	128-bit (Q)
Vector length model	Fixed 128 bits	Fixed 128 bits
Predication or masking	Minimal, no dedicated mask registers	No dedicated mask registers; use bitwise selects and conditionals
Gather/scatter	No native gather/scatter; gather in AVX2 and scatter in AVX-512	No native gather/scatter; emulate in software
Instruction set scope	Arithmetic, logical, shuffle, convert, basic SIMD	Arithmetic, logical, shuffle, saturating ops; cryptography via Armv8 Cryptography Extensions (AES and SHA)
Floating-point support	Single and double precision	Single and double precision
Typical applications	Legacy SIMD, general vector arithmetic	Multimedia, DSP, cryptography, embedded compute
Extensibility	Extended by AVX, AVX2, and AVX-512	Fixed at 128-bit; scalable vectors provided by SVE as a separate extension
Programming model	Intrinsics in C/C++; assembly for hotspots	Intrinsics widely used; inline assembly less common

A comparison of AVX and SVE (SVE2)

Feature	x86: AVX or AVX-512	Arm: SVE or SVE2
Register width	Fixed: 256-bit YMM, 512-bit ZMM	Scalable: 128 to 2048 bits in 128-bit steps
Vector length model	Fixed; often multiple code paths for different widths	Vector-length agnostic; same binary adapts to hardware width
Predication or masking	Mask registers in AVX-512	Rich predication via predicate registers
Gather or scatter	Gather in AVX2 and scatter in AVX-512	Native gather and scatter across widths
Key operations	Wide SIMD, FMA, conflict detection, advanced masking	Wide SIMD, FMA, predication, gather or scatter, reductions, bit manipulation
Best suited for	HPC, AI and ML, scientific computing, analytics	HPC, AI and ML, scientific computing, cloud and scalable workloads
Limitations	Power and thermal headroom under heavy 512-bit use; ecosystem complexity	Requires VLA programming style; SVE or SVE2 hardware availability varies by platform

Note

SVE2 extends SVE with richer integer and DSP capabilities for general-purpose and media workloads.

A comparison of AMX and SME

Feature	x86: AMX	Arm: SME
Register model	Tile registers configured via a palette; fixed per type limits	Scalable matrix tiles integrated with SVE; implementation-dependent dimensions
Vector length model	Fixed tile geometry per configuration	Scales with SVE vector length and streaming mode
Predication or masking	Predication not inherent to tiles	Predication via SVE predicate registers
Gather or scatter	Not provided in AMX tiles; handled elsewhere	Via SVE integration with gather or scatter
Key operations	Dot-product oriented GEMM and convolution	Outer-product matrix multiply; streaming mode for dense linear algebra
Best suited for	AI and ML training and inference, GEMM and convolution kernels	AI and ML training and inference, scientific and HPC dense linear algebra
Limitations	Hardware and software availability limited to specific CPUs	Emerging hardware support; compiler and library support evolving

The key differences for developers

The most significant changes when porting include moving from fixed-width SIMD to vector-length-agnostic loop structures, replacing mask-register control with predicate-driven control, and adjusting memory access patterns and compiler flags. Review this section first to minimize rework and preserve portable performance.

Vector length model

x86 SIMD (SSE, AVX, and AVX-512) uses fixed widths of 128, 256, or 512 bits. This often requires multiple code paths or dispatch strategies. Arm NEON is also fixed at 128-bit and is a familiar baseline. SVE and SME introduce vector-length agnostic execution from 128 to 2048 bits so the same binary scales across implementations.

Programming and intrinsics

x86 intrinsics are extensive, and AVX-512 adds masks and lane controls that increase complexity. NEON intrinsics look familiar to SSE developers. SVE and SME use predication and scalable loops. Prefer auto-vectorization and VLA-friendly patterns over heavy hand-written intrinsics when portability matters.

Matrix acceleration

AMX provides fixed-geometry tile compute optimized for dot products. SME extends Arm’s scalable model with outer-product math, scalable tiles, and streaming mode. Both AMX and SME are currently available on a limited set of platforms.

Summary

Migrating from x86 SIMD to Arm entails adopting Arm’s scalable and predicated programming model with SVE and SME for forward-portable performance, while continuing to use NEON for fixed-width SIMD similar to SSE.

Migration tools

Several libraries help translate or abstract SIMD intrinsics to speed up migration. Coverage varies, and some features have no direct analogue.

Here are some of the tools available and their key features:

Sse2neon: an open-source header that maps many SSE2 intrinsics to NEON equivalents. Good for getting code building quickly. Review generated code for performance. See the sse2neon GitHub repository .
SIMD Everywhere (SIMDe): a header-only portability layer that implements many x86 and Arm intrinsics across ISAs, with scalar fallbacks when SIMD is unavailable. See the simde-everywhere GitHub repository .
Google Highway (hwy): a portable SIMD library and APIs that target multiple ISAs, including NEON, SVE where supported, and AVX, without per-ISA code paths. See the Google highway GitHub repository .

For more on cross-platform intrinsics, see the Learning Path Porting architecture-specific intrinsics .

Back

Migrate x86-64 SIMD to Arm64

Introduction

Migrate SIMD code to the Arm architecture

Explore vector extension code examples

Next Steps

Migrate x86-64 SIMD to Arm64

Vectorization on x86 and Arm

Arm vector and matrix extensions

NEON

Scalable Vector Extension (SVE)

Scalable Matrix Extension (SME)

x86 vector and matrix extensions

Streaming SIMD Extensions (SSE)

Advanced Vector Extensions (AVX)

Advanced Matrix Extensions (AMX)

Comparison tables

A comparison of SSE and NEON

A comparison of AVX and SVE (SVE2)

A comparison of AMX and SME

The key differences for developers

Vector length model

Programming and intrinsics

Matrix acceleration

Summary

Migration tools