Now that you have seen the performance of multithreading, you can move on to deploying Arm Performance Libraries, and you can explore the differences.
Arm Performance Libraries provides optimized standard core math libraries for numerical applications on 64-bit Arm-based processors. The libraries are built with OpenMP across many Basic Linear Algebra Subprograms (BLAS), LAPACK, FFT, and sparse routines in order to maximize performance in multi-processor environments.
Use the Arm Performance Libraries install guide to install Arm Performance Libraries on Windows 11.
You can also refer to the Arm Performance Libraries documentation for Windows.
After successful installation, you will find five directories in the installation folder.
The include
and lib
are the directories containing header files and library files, respectively.
Take note of the location of these two directories, as you will need them for configuring Visual Studio.
Figure 9: Arm Performance Libraries Directory.
To use Arm Performance Libraries in the application, you need to manually add the paths into Visual Studio.
In your Visual Studio project, configure two places in your Visual Studio project:
Configuration Properties
. Select VC++ Directories
.Additional Include Directories
setting.<Edit...>
.New Line
icon to add Arm Performance Libraries include
path.Figure 10: External Include Directories.
<Edit...>
New Line
icon to add Arm Performance Libraries library
path.Figure 11: Linker Library.
Visual Studio allows users to set the above two paths for each individual configuration. To apply the settings to all configurations in your project, select All Configurations in the Configuration drop-down menu.
You are now ready to use Arm Performance Libraries in your project.
Open the source code file SpinTheCubeInGDI.cpp
and search for the _USE_ARMPL_DEFINES
definition.
You will see a commented-out definition on line 13 of the program. Removing the comment enables the Arm Performance Libraries feature when you rebuild the application.
Figure 12: Arm Performance Libraries Definition.
When variable useAPL is True, the application calls applyRotationBLAS()
instead of the multithreading code, to apply the rotation matrix to the 3D vertices.
The code is shown below:
void RotateCube(int numCores)
{
//
//
//
if (useAPL)
{
applyRotationBLAS(UseCube ? cubeVertices : sphereVertices, rotationInX);
}
else
{
for (int x = 0; x < numCores; x++)
{
ReleaseSemaphore(semaphoreList[x], 1, NULL);
}
WaitForMultipleObjects(numCores, doneList.data(), TRUE, INFINITE);
}
Calculations++;
}
The applyRotationBLAS()
function adopts a BLAS matrix multiplier instead of multithreading multiplication for calculating the rotation.
Basic Linear Algebra Subprograms (BLAS) are a set of well-defined basic linear algebra operations in Arm Performance Libraries.
See cblas_dgemm to learn more about the function.
Here is the code used to compute rotation with BLAS:
void applyRotationBLAS(std::vector<double>& shape, const std::vector<double>& rotMatrix)
{
EnterCriticalSection(&cubeDraw[0]);
#if defined(_M_ARM64) && defined(_USE_ARMPL_DEFINES)
// Call the BLAS matrix mult for doubles.
// Multiplies each of the 3d points in shape
// list with rotation matrix, and applies scale
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, (int)shape.size() / 3, 3, 3, scale, shape.data(), 3, rotMatrix.data(), 3, 0.0, drawSphereVertecies.data(), 3);
#endif
LeaveCriticalSection(&cubeDraw[0]);
}
Rebuild the code and run SpinTheCubeInGDI.exe
again.
Click on the Options menu in the top-left corner of the program, then select Use APL to utilize Option 2.
Figure 13: Selecting Arm Performance Libraries.
On the Lenovo X13s, the performance is between 11k and 12k FPS.
Figure 14: Spinning Geometry Demonstration: Arm64.
Re-run the profiling tools.
You will see that the CPU usage has decreased significantly. There is no difference in memory usage.
Figure 15: Improved CPU Performance.
You have learned how to optimize application performance using Arm Performance Libraries.