This section probes the intricate details of what KleidiAI is doing and how it is beneficial to AI performance optimization.

Navigate to the
KleidiAI GitLab repository
and view the
example file
that you will be running. This example highlights the *i8mm* matrix multiplication micro-kernel, alongside the enabling packing/quantization micro-kernels.

The example code is structured to take in two matrices and compare KleidiAI’s micro-kernel results to a reference implementation.

This reference code is functionally-identical to KleidiAI’s micro-kernels, and is present for two reasons:

- To check for KleidiAI micro-kernel output validity as a sanity check.
- To explain how KleidiAI micro-kernels work. The implementation within KleidiAI’s micro-kernels use hand-optimized C and Arm Assembly code, which can be difficult to interpret.

Follow these steps to build and run the KleidiAI library and example script:

Create an Ubuntu 24.04 Arm Linux machine on an AWS EC2 instance. For more details view the Learning Path on setting up AWS EC2 Graviton instances . Use an M7g-medium instance type, which uses the Graviton 3 SoC supporting the

*i8mm*Arm architecture feature. The 1 CPU and 4 GB of RAM in the M7g-medium are sufficient for this basic example run.Initialize your system by installing essential packages:

` ````
sudo apt update
sudo apt install cmake g++ make
```

- Clone the KleidiAI directory to your server:

` ````
git clone https://git.gitlab.arm.com/kleidi/kleidiai.git
```

- Build KleidiAI library with CMake. After running these commands you should see a library file named
`libkleidiai.a`

:

` ````
cd kleidiai
mkdir build
cd build
cmake ..
make
```

- Build the example file with CMake. From the /kleidiai/build directory:

` ````
cd ../examples/matmul_clamp_f32_qai8dxp_qsi4cxp/
mkdir build
cd build
cmake ..
make
```

If you get an error that looks like `#error "Dotprod and I8mm extensions required to compile this example"`

then you need to signal to KleidiAI that your machine supports the Dotprod and i8mm extensions.

Do this by adding the following add_compile_options line to the `CMakeLists.txt`

file in the `/examples/matmul_clamp_f32_qai8dxp_qsi4cxp/`

directory:

` ````
cmake_minimum_required(VERSION 3.16)
set(CMAKE_CXX_STANDARD 17)
set(KLEIDIAI_PATH ../../)
set(MATMUL_PACK_PATH ${KLEIDIAI_PATH}/kai/ukernels/matmul/pack/)
set(MATMUL_PATH ${KLEIDIAI_PATH}/kai/ukernels/matmul/matmul_clamp_f32_qai8dxp_qsi4cxp/)
######## Add this command
add_compile_options(-march=armv8.6-a+dotprod+i8mm)
```

Then clear your build directory and build again.

- Run the example. You will see each micro-kernel being tested, followed by confirmation that the reference implementation and KleidiAI’s implementation are numerically equal - signaling success!

` ````
./matmul_clamp_f32_qai8dxp_qsi4cxp
```

` ````
Testing matmul_clamp_f32_qai8dxp1x8_qsi4cxp4x8_1x4x32_neon_dotprod
TEST[0] = PASSED
Testing matmul_clamp_f32_qai8dxp1x8_qsi4cxp8x8_1x8x32_neon_dotprod
TEST[1] = PASSED
Testing matmul_clamp_f32_qai8dxp4x8_qsi4cxp4x8_4x4x32_neon_i8mm
TEST[2] = PASSED
Testing matmul_clamp_f32_qai8dxp4x8_qsi4cxp4x8_8x4x32_neon_i8mm
TEST[3] = PASSED
Testing matmul_clamp_f32_qai8dxp4x8_qsi4cxp8x8_4x8x32_neon_i8mm
TEST[4] = PASSED
Testing matmul_clamp
```

Again, this manual interaction with KleidiAI is for demonstration purposes. In real-world applications, your supported ML Framework handles interfacing with the KleidiAI library.

For the rest of this walk-through, you will look over the reference kernel functions to understand what KleidiAI is doing before viewing under-the-hood snippets of hand-coded Arm instructions.

Next you will look at the KleidiAI’s reference implementation in the example file to understand with the micro-kernels do to accelerate AI inference. The reference implementation is easier to understand by looking at the code and is functionally equivalent to KleidiAI’s actual micro-kernels. Links to KleidiAI’s specific micro-kernel routines are also included.

You will now look at the three main routines:

- Quantizing & Packing the RHS matrix.
- Quantizing & Packing the LHS matrix.
- Matrix multiplication.

Recall that the RHS matrix is populated by the AI model weights, which are dynamically quantized to INT4 format - if they are not in that format already. The RHS quantizing and packing function is defined in the example file as this:

` ````
static void quant_qs4cx_f32(size_t n, size_t k, const float* rhs_f32, uint8_t* rhs_qs4cx, float* rhs_scales_f32)
```

The function takes in the following parameters:

- Input matrix dimensions (
`n`

,`k`

). - Input matrix location (
`rhs_f32`

). - Output matrix location (
`rhs_qs4cx`

). This is a 1-dimensional matrix that packs the INT4 numbers into 8-bit chunks of memory. - Scale factors (
`rhs_scales_f32`

). This provides the exact number used to quantize the incoming model weight numbers (up to FP32 format) to INT4. Each channel is quantized independently and thus has a unique scale factor.

You can tell the target is of the INT4 type based on the naming, `qs4cx`

, which translates to quantized (q), symmetric (s), 4-bit integers (4) per channel (cx). Each ‘channel’ in this matrix refers to all the weights feeding into a given channel (ie. a given neuron). Because the weights feeding into different neurons may vary significantly, quantizing by channel offers improved precision and flexibility.

After calculating variables like the destination matrix stride length and scale factors, the quantization and packing occurs:

` ````
uint8_t* dst_ptr = (uint8_t*)rhs_qs4cx + row_idx * dst_stride;
// Quantize the channels
for (size_t k_idx = 0; k_idx < k; k_idx += 2) {
const float src0_0 = src_ptr[k_idx + 0];
const float src0_1 = src_ptr[k_idx + 1];
// Scale the values
int32_t v0_s32 = (int32_t)(round(src0_0 * scale0));
int32_t v1_s32 = (int32_t)(round(src0_1 * scale0));
// Maximum/minimum int4 values
v0_s32 = std::max(v0_s32, INT4_MIN);
v0_s32 = std::min(v0_s32, INT4_MAX);
v1_s32 = std::max(v1_s32, INT4_MIN);
v1_s32 = std::min(v1_s32, INT4_MAX);
int32_t v0_u8 = (uint8_t)(v0_s32 + 8);
int32_t v1_u8 = (uint8_t)(v1_s32 + 8);
const uint8_t rhs_v0 = (v1_u8 << 4) | v0_u8;
dst_ptr[0] = rhs_v0;
dst_ptr += sizeof(uint8_t);
}
}
```

Next, the LHS matrix is quantized and packed into memory.

The LHS matrix will be populated by the inputs flowing through the model layers. They are also dynamically quantized, this time to the INT8 format. The LHS quantizing and packing function looks like:

` ````
static void ref_quant_qa8dx_f32(size_t m, size_t k, const float* lhs_f32, int8_t* lhs_qa8dx) {
```

The function takes in the following parameters:

- Input matrix dimensions (
`m`

,`k`

). Note that`k`

is a shared dimension between RHS and LHS, ensuring matrix multiplication is possible. - Input matrix location (
`lhs_f32`

). - Output matrix location (
`lhs_qa8xd`

). This is a 1-dimensional matrix.

Scale factors are also required in the LHS quantization, but for computational efficiency are stored in the LHS output matrix directly.

You can tell the target is of the INT8 type based on the naming, qa8dx, which translates to quantized (q), asymmetric (a), 8-bit integers (8) per dimension (dx). Each ‘dimension’ in this matrix refers to a different feature in the input data (ie. in image processing, input dimensions being RGB - red, green, and blue). Because each input dimension can have a different distribution of values, quantizing by dimension offers improved precision and consistency.

The next few steps in this LHS micro-kernel are functionally equal to the RHS micro-kernel: Calculating the stride length and scale factor. However, before the quantization takes place, the LHS matrix calculates where the appropriate zero point should be to handle a wider range of input values not centered around zero with more flexibility. This is referred to as ‘asymmetric’ quantization, allowing for different scale and zero points for each matrix row.

The zero point is calculated here:

` ````
const float zero_point_from_min_error0 = qmin + descaled_min0;
const float zero_point_from_max_error0 = qmax + descaled_max0;
float zero_point0 =
zero_point_from_min_error0 + zero_point_from_max_error0 > 0 ? qmin - descaled_min0 : qmax - descaled_max0;
zero_point0 = std::max(zero_point0, qmin);
zero_point0 = std::min(zero_point0, qmax);
// Round to nearest integer
const int32_t nudged_zero_point0 = lrintf(zero_point0);
```

After quantizing and packing the LHS matrix (similar to the RHS matrix process), the numbers are prepared for efficient matrix multiplication.

The LHS matrix will be populated by the inputs flowing through the model layers. They are also dynamically quantized, this time to the INT8 format. The LHS quantizing and packing function looks like:

` ````
static void ref_quant_qa8dx_f32(size_t m, size_t k, const float* lhs_f32, int8_t* lhs_qa8dx) {
```

The function takes in the following parameters:

- Input matrix dimensions (
`m`

,`k`

). Note that`k`

is a shared dimension between RHS and LHS, ensuring matrix multiplication is possible. - Input matrix location (
`lhs_f32`

). - Output matrix location (
`lhs_qa8xd`

). This is a 1-dimensional matrix.

The reference matrix multiplication takes place in the function `ref_matmul_f32_qa8dx_qs4cx`

.

` ````
ref_matmul_f32_qa8dx_qs4cx(
m, n, k, (const int8_t*)lhs_ref_mtx_qa8dx, (const uint8_t*)rhs_native_mtx_qs4cx, (const float*)rhs_scales_f32,
(float*)dst_ref_mtx_f32, -FLT_MAX, FLT_MAX);
```

The function takes in the following parameters:

- Input matrix dimensions (
`m`

,`n`

,`k`

). - Input matrices, quantized and packed (
`lhs_ref_matx_qa8xd`

,`rhs_native_mtx_qs4cx`

) - The scale factor for the RHS matrix (
`rhs_scales_f32`

). Recall that the LHS offsets are stored in the matrix itself. - The pointer to the output matrix (
`dst_ref_mtx_f32`

). - Upper and lower bounds on floating point numbers for easier processing (
`-FLT_MAX`

,`FLT_MAX`

).

The function’s name describes what it does and its input/output formats. It is the reference (ref) matrix multiplication (matmul), outputting in floating-point 32-bit format (f32), taking in two matrices of previously described formats.

The implementation of the matrix multiplication is relatively straightforward to understand. First the function calculates how far to stride to move over each input matrix’s rows while iterating:

` ````
const size_t lhs_stride = k * sizeof(int8_t) + sizeof(float) + sizeof(int32_t);
const size_t rhs_stride = (k / 2) * sizeof(uint8_t);
```

Next the main loop begins, incrementing row and column indexes to obtain the correct pointers (`lhs_ptr`

and `rhs_ptr`

) to access individual rows in each input matrix. This is the loop where matrix multiplication actually takes place. The loop iterates over the input matrices’ shared dimension `k`

. It steps in increments of 2 (`b += 2`

) as the LHS matrix needs to get two INT8 numbers to multiply against the two INT4 numbers from the RHS matrix’s INT8 memory location. The numbers from each matrix are multiplied and added to the FP32 sized `iacc`

variable.

` ````
for (size_t b = 0; b < k; b += 2) {
// Get the LHS values
const int32_t lhs_v0 = (int32_t)lhs_ptr[0];
const int32_t lhs_v1 = (int32_t)lhs_ptr[1];
// Get the RHS values
const uint8_t rhs_byte = rhs_ptr[0];
// Unpack the RHS values
const int32_t rhs_v0 = (((int32_t)(rhs_byte & 0x0F)) - 8);
const int32_t rhs_v1 = (((int32_t)(rhs_byte >> 4)) - 8);
iacc += lhs_v0 * rhs_v0;
iacc += lhs_v1 * rhs_v1;
```

These dot product calculations, `iacc += lhs_v0 * rhs_v0; iacc += lhs_v1 * rhs_v1;`

, is the operation that the *i8mm* architecture feature substantially accelerates. These calculations are performed using *SMMLA* instructions. You can view the hand-coded assembly implementation of this function in
this KleidiAI micro-kernel
.

The last step is to convert the result back into the native number format, using the LHS and RHS scale factors. The final result is `dst_ref_mtx_f32`

, the output matrix in the native number format. This matrix is represented in 1D, and is the ultimate output of the KleidiAI micro-kernels.

If you navigate to the
/kleidiai/kai/ukernels/matmul/matmul_clamp_f32_qai8dxp_qsi4cxp
directory where the KleidiAI INT8/INT4 micro-kernel implementation is located, you can see multiple variants of the same micro-kernel. The difference between them lies in the in/out matrix processing block size. Instead of performing matrix multiplication on the entire `m`

x`k`

and `n`

x`k`

matrices at once, KleidiAI provides matmul variants that perform smaller operations using hand-optimized assembly code. This leads to more efficient matrix multiplication.

Different ML workloads and models have a varying performance across the KleidiAI matmul variants. Your ML framework selects the correct micro-kernel to maximize workload performance.

You now have an understanding of how KleidiAI accelerates matrix multiplication, and ultimately how GenAI models can run efficiently on Arm CPUs from servers to smartphones.

Integrating KleidiAI to software frameworks is leading to significant performance boosts in real world generative AI workloads. For instance, Meta’s Llama 3 and Microsoft’s Phi-3 LLMs experience a 190 percent faster time-to-first token on the new Arm Cortex-X925 CPU . Additionally, KleidiAI improved the time-to-first token for Gemma 2B on the Google Pixel 8 Pro by 25 percent.