Introduction
AFM-4.5B deployment on Google Cloud Axion with Llama.cpp
Provision a Google Cloud Axion Arm64 environment
Configure your Google Cloud Axion Arm64 environment
Build Llama.cpp on Google Cloud Axion Arm64
Install Python dependencies for Llama.cpp
Download and optimize the AFM-4.5B model for Llama.cpp
Run inference with AFM-4.5B using Llama.cpp
Benchmark and evaluate AFM-4.5B quantized models on Axion
Review your AFM-4.5B deployment on Axion
Next Steps
In this step, you’ll build Llama.cpp from source. Llama.cpp is a high-performance C++ implementation of the LLaMA model, optimized for inference on multiple hardware platforms, including Arm64 processors such as Google Cloud Axion.
Although AFM-4.5B uses a custom architecture, you can use the standard Llama.cpp repository. Arcee AI has contributed the required modeling code upstream.
git clone https://github.com/ggerganov/llama.cpp
This command clones the Llama.cpp repository from GitHub. The repository includes source code, build scripts, and documentation.
cd llama.cpp
Move into the llama.cpp
directory to run the build process. This directory contains the CMakeLists.txt
file and all source code.
cmake -B .
This configures the build system using CMake:
-B .
generates build files in the current directoryOn Google Cloud Axion, the output should show hardware-specific optimizations for the Neoverse V2 architecture:
-- ARM feature DOTPROD enabled
-- ARM feature SVE enabled
-- ARM feature MATMUL_INT8 enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+dotprod+i8mm+sve
These optimizations enable advanced Arm64 CPU instructions:
cmake --build . --config Release -j16
This compiles Llama.cpp with the following options:
--build .
builds in the current directory--config Release
enables compiler optimizations-j16
runs 16 parallel jobs for faster compilation on multi-core Axion systemsThe build produces Arm64-optimized binaries in under a minute.
After compilation, you’ll find key tools in the bin
directory:
llama-cli
: main inference executablellama-server
: HTTP server for model inferencellama-quantize
: tool for quantization to reduce memory usageSee the Llama.cpp GitHub repository for details.
These binaries are optimized for Arm64 and provide excellent performance on Google Cloud Axion.