Download and optimize the AFM-4.5B model for Llama.cpp

Deploy Arcee AFM-4.5B on Arm-based Google Cloud Axion with Llama.cpp

Log an issue

Fork and edit

Discuss on Discord

Deploy Arcee AFM-4.5B on Arm-based Google Cloud Axion with Llama.cpp

In this step, you’ll download the AFM-4.5B model from Hugging Face, convert it to the GGUF format for compatibility with Llama.cpp, and generate quantized versions to optimize memory usage and inference speed.

Note: If you want to skip model optimization, pre-converted GGUF versions are available.

Make sure your Python virtual environment is activated before running commands. These instructions show you how to prepare AFM-4.5B for efficient inference on Google Cloud Axion Arm64 with Llama.cpp.

To download AFM-4.5B, you need to:

Sign up for a Hugging Face account at https://huggingface.co
Create a read-only token at https://huggingface.co/settings/tokens (store it securely; it is only shown once)
Accept the model terms at AFM-4.5B

Install Hugging Face libraries

    

        
        
pip install huggingface_hub hf_xet --upgrade

This installs:

huggingface_hub: Python client for downloading models and datasets
hf_xet: Git extension for fetching large model files from Hugging Face

These tools include the hf CLI.

Log in to Hugging Face Hub

    

        
        
hf auth login

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible):

Please enter the token you created above, and answer ’n’ to “Add token as git credential? (Y/n)”.

Download AFM-4.5B from Hugging Face

    

        
        
hf download arcee-ai/afm-4.5B --local-dir models/afm-4-5b/

This command downloads the model to the models/afm-4-5b directory:

arcee-ai/afm-4.5B is the Hugging Face model identifier.
The download includes the model weights, configuration files, and tokenizer data.
This is a 4.5 billion parameter model, so the download can take several minutes depending on your internet connection.

Convert AFM-4.5B to GGUF format

    

        
        
python3 convert_hf_to_gguf.py models/afm-4-5b
deactivate

This command converts the downloaded Hugging Face model to GGUF (GGML Universal Format):

convert_hf_to_gguf.py is a conversion script that comes with Llama.cpp.
models/afm-4-5b is the input directory containing the Hugging Face model files.
The script reads the model architecture, weights, and configuration from the Hugging Face format.
It outputs a single afm-4-5B-F16.gguf ~15GB file in the same models/afm-4-5b/ directory.
GGUF is the native format for Llama.cpp, optimized for efficient loading and inference.

Next, deactivate the Python virtual environment as future commands won’t require it.

Create a Q4_0 quantized version

    

        
        
bin/llama-quantize models/afm-4-5b/afm-4-5B-F16.gguf models/afm-4-5b/afm-4-5B-Q4_0.gguf Q4_0

This command creates a 4-bit quantized version of the model:

llama-quantize is the quantization tool from Llama.cpp.
afm-4-5B-F16.gguf is the input GGUF model file in 16-bit precision.
Q4_0 applies zero-point 4-bit quantization.
This reduces the model size by approximately 70% (from ~15GB to ~4.4GB).
The quantized model will use less memory and run faster, though with a small reduction in accuracy.
The output file will be afm-4-5B-Q4_0.gguf.

Arm optimizations for quantized models

Arm has contributed optimized kernels for Q4_0 that use Neoverse V2 instruction sets. These low-level routines accelerate math operations, delivering strong performance on Axion.

These instruction sets allow Llama.cpp to run quantized operations significantly faster than generic implementations, making Arm processors a competitive choice for inference workloads.

Create a Q8_0 quantized version

    

        
        
bin/llama-quantize models/afm-4-5b/afm-4-5B-F16.gguf models/afm-4-5b/afm-4-5B-Q8_0.gguf Q8_0

This command creates an 8-bit quantized version of the model:

Q8_0 specifies 8-bit quantization with zero-point compression.
This reduces the model size by approximately 45% (from ~15GB to ~8GB).
The 8-bit version provides a better balance between memory usage and accuracy than 4-bit quantization.
The output file is named afm-4-5B-Q8_0.gguf.
Commonly used in production scenarios where memory resources are available.

Arm optimization

Similar to Q4_0, Arm has contributed optimized kernels for Q8_0 quantization that take advantage of Neoverse V2 instruction sets. These optimizations provide excellent performance for 8-bit operations while maintaining higher accuracy compared to 4-bit quantization.

AFM-4.5B models ready for inference

After completing these steps, you’ll have three versions of the AFM-4.5B model in models/afm-4-5b:

afm-4-5B-F16.gguf - The original full-precision model (~15GB)
afm-4-5B-Q4_0.gguf - 4-bit quantized version (~4.4GB) for memory-constrained environments
afm-4-5B-Q8_0.gguf - 8-bit quantized version (~8GB) for balanced performance and memory usage

These models are now ready to use with the Llama.cpp inference engine on Google Cloud Axion Arm64.

Back

Deploy Arcee AFM-4.5B on Arm-based Google Cloud Axion with Llama.cpp

Introduction

AFM-4.5B deployment on Google Cloud Axion with Llama.cpp

Provision a Google Cloud Axion Arm64 environment

Configure your Google Cloud Axion Arm64 environment

Build Llama.cpp on Google Cloud Axion Arm64

Install Python dependencies for Llama.cpp

Download and optimize the AFM-4.5B model for Llama.cpp

Run inference with AFM-4.5B using Llama.cpp

Benchmark and evaluate AFM-4.5B quantized models on Axion

Review your AFM-4.5B deployment on Axion

Next Steps

Deploy Arcee AFM-4.5B on Arm-based Google Cloud Axion with Llama.cpp

Install Hugging Face libraries

Log in to Hugging Face Hub

Download AFM-4.5B from Hugging Face

Convert AFM-4.5B to GGUF format

Create a Q4_0 quantized version

Arm optimizations for quantized models

Create a Q8_0 quantized version

Arm optimization

AFM-4.5B models ready for inference

Download and optimize the AFM-4.5B model for Llama.cpp

Deploy Arcee AFM-4.5B on Arm-based Google Cloud Axion with Llama.cpp

Introduction

AFM-4.5B deployment on Google Cloud Axion with Llama.cpp

Provision a Google Cloud Axion Arm64 environment

Configure your Google Cloud Axion Arm64 environment

Build Llama.cpp on Google Cloud Axion Arm64

Install Python dependencies for Llama.cpp

Download and optimize the AFM-4.5B model for Llama.cpp

Run inference with AFM-4.5B using Llama.cpp

Benchmark and evaluate AFM-4.5B quantized models on Axion

Review your AFM-4.5B deployment on Axion

Next Steps

Deploy Arcee AFM-4.5B on Arm-based Google Cloud Axion with Llama.cpp

Sign up to Hugging Face

Install Hugging Face libraries

Log in to Hugging Face Hub

Download AFM-4.5B from Hugging Face

Convert AFM-4.5B to GGUF format

Create a Q4_0 quantized version

Arm optimizations for quantized models

Create a Q8_0 quantized version

Arm optimization

AFM-4.5B models ready for inference