This section provides an abstracted overview of KleidiAI’s components before diving into the specifics. The KleidiAI source files are publicly accessible in the KleidiAI GitLab repository . Navigate there in your web browser to follow along and discover KleidiAI’s structure.
KleidiAI’s micro-kernels are located in the /kai/ukernels/matmul
directory; navigate there now.
There are essentially two types of KleidiAI micro-kernels today:
pack
directory.matmul_clamp
. Each directory contains routines specialized for a specific input data type.KleidiAI has multiple matrix multiplication micro-kernels, and dynamic quantization routines, to optimally support all model quantization levels. To learn more about model quantization and how selecting the right quantization level affects your AI-based application, refer to this Learning Path .
KleidiAI currently has three matrix multiplication directories that each handle input/output types differently, and which will evolve to broaden the reach of their support:
uKernel | Output type | Input types |
---|---|---|
matmul_clamp_f16_f16_f16 | 16-bit floating-point | 16-bit floating-point |
matmul_clamp_f32_f32_f32 | 16-bit floating-point | 32-bit floating-point |
matmul_clamp_f32_qa8dxP_qs4cxP | 32-bit floating-point | 8-bit integer and 4-bit integer |
Only one matrix multiply micro-kernel is used for your given AI application. Each AI model and workload (for example using a Gemma model running text generation), has unique inference characteristics. KleidiAI has various matrix multiplication micro-kernel routines to optimize the inference speed of different workloads. The ML Framework provider selects the optimal micro-kernel on your behalf when implementing KleidiAI in their framework. No extra effort is required when using a framework with KleidiAI.
Before deep-diving into KleidiAI’s code, it is helpful to see how KleidiAI micro-kernels interact with a GenAI model and a ML Framework at a high-level. The steps below describe how KleidiAI speeds up matrix multiplication. The example is of a user asking a chatbot app a question on their smartphone. This is the example application stack to analyze:
This is a brief overview, but is helpful to help you understand how KleidiAI interacts with ML Frameworks and Generative AI models.
There are further details and nuances to these processes that will help you understand KleidiAI better.
KleidiAI optimizes for a balance of size, accuracy, and execution speed.
Model weights can be quantized down to INT4 without inducing large errors because after training, weights typically stay within a range suitable for INT4 quantization. Furthermore, model sizes are halved by selecting INT4 over INT8 or FP16 quantization, a significant benefit to both memory storage and throughput - especially critical when deploying to constrained devices like smartphones.
In contrast, neural network activations (described as ‘inputs’ so far) are highly variable and not as evenly distributed across the full data range. For example, a common activation function - ReLU - outputs zero for any negative input and leaves any positive input unchanged. This results in a number distribution with many small values but also occasional large values. Having only 16 distinct values (from INT4 quantization) results in large quantization errors.
As a result, KleidiAI strategically-selected INT4 quantization for model parameters and INT8 for inputs propagating through the neural network.
Most models are trained in FP32 or FP16 formats and are by default available to download at that size. To realize the benefits of lower memory footprint, select a pre-quantized version of your desired AI model, ideally in INT4 format. You can quantize models yourself through various tools, but the quickest and easiest way is to locate a pre-quantized version of the same model.
KleidiAI supports matrix multiplication of models across FP32, FP16, INT8, and INT4 formats. Your ML framework provider selects the optimal KleidiAI micro-kernel to balance inference speed and accuracy for your use case.
In short, it is recommended that you select an INT4 pre-quantized model when inference speed is critical, and especially on constrained devices like smartphones.
The goal of KleidiAI’s packing micro-kernels is to prepare the incoming matrices for efficient matrix multiplication. For each matrix multiplication routine is a corresponding packing routine, which may or may not require dynamic quantization depending on the incoming number formats.
For example, the SMMLA instruction operates on 8-bit numbers. The role of packing after quantization is to organize two 4-bit integers into a single 8-bit memory space, and the SMMLA instructions are handwritten to efficiently operate on the two packed integers at once.
The power of KleidiAI stems from its deep understanding of AI workload requirements, quantization techniques, data packing strategies, and advanced Arm instructions, which combine to accelerate and optimize the performance of AI models on Arm CPUs.
The next section dives into the technical details of how KleidiAI delivers these performance uplifts by stepping through a C++ example.