Introduction
AFM-4.5B deployment on Google Cloud Axion with Llama.cpp
Provision a Google Cloud Axion Arm64 environment
Configure your Google Cloud Axion Arm64 environment
Build Llama.cpp on Google Cloud Axion Arm64
Install Python dependencies for Llama.cpp
Download and optimize the AFM-4.5B model for Llama.cpp
Run inference with AFM-4.5B using Llama.cpp
Benchmark and evaluate AFM-4.5B quantized models on Axion
Review your AFM-4.5B deployment on Axion
Next Steps
In this step, you’ll download the AFM-4.5B model from Hugging Face, convert it to the GGUF format for compatibility with Llama.cpp, and generate quantized versions to optimize memory usage and inference speed.
Note: If you want to skip model optimization, pre-converted GGUF versions are available.
Make sure your Python virtual environment is activated before running commands. These instructions show you how to prepare AFM-4.5B for efficient inference on Google Cloud Axion Arm64 with Llama.cpp.
To download AFM-4.5B, you need to:
pip install huggingface_hub hf_xet --upgrade
This installs:
huggingface_hub
: Python client for downloading models and datasetshf_xet
: Git extension for fetching large model files from Hugging FaceThese tools include the hf
CLI.
hf auth login
_| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_|
_| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
_|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_|
_| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
_| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_|
To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible):
Please enter the token you created above, and answer ’n’ to “Add token as git credential? (Y/n)”.
hf download arcee-ai/afm-4.5B --local-dir models/afm-4-5b/
This command downloads the model to the models/afm-4-5b
directory:
arcee-ai/afm-4.5B
is the Hugging Face model identifier.
python3 convert_hf_to_gguf.py models/afm-4-5b
deactivate
This command converts the downloaded Hugging Face model to GGUF (GGML Universal Format):
convert_hf_to_gguf.py
is a conversion script that comes with Llama.cpp.models/afm-4-5b
is the input directory containing the Hugging Face model files.afm-4-5B-F16.gguf
~15GB file in the same models/afm-4-5b/
directory.Next, deactivate the Python virtual environment as future commands won’t require it.
bin/llama-quantize models/afm-4-5b/afm-4-5B-F16.gguf models/afm-4-5b/afm-4-5B-Q4_0.gguf Q4_0
This command creates a 4-bit quantized version of the model:
llama-quantize
is the quantization tool from Llama.cpp.afm-4-5B-F16.gguf
is the input GGUF model file in 16-bit precision.Q4_0
applies zero-point 4-bit quantization.afm-4-5B-Q4_0.gguf
.Arm has contributed optimized kernels for Q4_0 that use Neoverse V2 instruction sets. These low-level routines accelerate math operations, delivering strong performance on Axion.
These instruction sets allow Llama.cpp to run quantized operations significantly faster than generic implementations, making Arm processors a competitive choice for inference workloads.
bin/llama-quantize models/afm-4-5b/afm-4-5B-F16.gguf models/afm-4-5b/afm-4-5B-Q8_0.gguf Q8_0
This command creates an 8-bit quantized version of the model:
Q8_0
specifies 8-bit quantization with zero-point compression.afm-4-5B-Q8_0.gguf
.Similar to Q4_0, Arm has contributed optimized kernels for Q8_0 quantization that take advantage of Neoverse V2 instruction sets. These optimizations provide excellent performance for 8-bit operations while maintaining higher accuracy compared to 4-bit quantization.
After completing these steps, you’ll have three versions of the AFM-4.5B model in models/afm-4-5b
:
afm-4-5B-F16.gguf
- The original full-precision model (~15GB)afm-4-5B-Q4_0.gguf
- 4-bit quantized version (~4.4GB) for memory-constrained environmentsafm-4-5B-Q8_0.gguf
- 8-bit quantized version (~8GB) for balanced performance and memory usageThese models are now ready to use with the Llama.cpp inference engine on Google Cloud Axion Arm64.