Migrating C/C++ applications from x64 to Arm requires recompiling the source code for the Arm architecture. A simple recompile works much of the time, but not always.
SIMD extensions are one of the common barriers encountered when porting C/C++ applications from x64 to Arm. This article is a short background on intrinsics and how to identify them in code. This Learning Path presents options for how to get the code compiled and running on an Arm-based platform.
Intrinsics are functions which are built into the compiler and not part of a library. They look like function calls, but don’t require an actual function call. When the compiler encounters intrinsics it directly substitutes a sequence of instructions. Intrinsics are often used to access special instructions that don’t have a direct mapping from C/C++ or when performance optimization is needed.
One use of intrinsics is to access SIMD (single-instruction, multiple-data) instructions directly from C/C++ for improved application performance. Intrinsics are easier to work with compared to assembly language, but they often pose a challenge when porting source code to a new architecture.
Intel Streaming SIMD Extensions (SSE) and Arm NEON SIMD instructions increase processor throughput by performing multiple computations with a single instruction. Over the years, Intel and Arm have introduced a variety of SIMD extensions. NEON is used in many of the Arm Cortex-A, Cortex-R, and Neoverse processors.
There are generally 3 ways to program SIMD hardware:
Below is a small example application containing intrinsics.
The source code comes from a
short course titled Efficient Vectorisation with C++
and is copyright (C) Christopher Woods, 2006-2015.
It is licensed under the
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
.
This example can be used to show how to port such code to Arm.
Copy the source code into a file named sse.cpp
.
#include <iostream>
#ifdef __SSE2__
#include <emmintrin.h>
#else
#warning SSE2 support is not available. Code will not compile
#endif
int main(int argc, char **argv)
{
__m128 a = _mm_set_ps(4.0, 3.0, 2.0, 1.0);
__m128 b = _mm_set_ps(8.0, 7.0, 6.0, 5.0);
__m128 c = _mm_add_ps(a, b);
float d[4];
_mm_storeu_ps(d, c);
std::cout << "result equals " << d[0] << "," << d[1]
<< "," << d[2] << "," << d[3] << std::endl;
return 0;
}
There is a reference to the __SSE2__
preprocessor define and the emmintrin.h
header file. These are architecture specific extensions.
The function calls starting with _mm
are
Intel SSE intrinsics
.
You can build this example on an x86_64
machine.
If not already installed, install g++
with the following command (on Ubuntu):
sudo apt install -y g++
Build the code with:
g++ -O2 -msse2 --std=c++14 sse.cpp -o sse
Run the code:
./sse
and observe the output:
result equals 6,8,10,12
Unsurprisingly, the code will fail to build as is on an Arm based machine.
The next sections will show how to migrate this application to Arm Neon.