In this step, you’ll download the AFM-4.5B model from Hugging Face, convert it to the GGUF format for compatibility with Llama.cpp, and generate quantized versions to optimize memory usage and inference speed.

Note: If you want to skip model optimization, pre-converted GGUF versions are available.

Make sure your Python virtual environment is activated before running commands. These instructions show you how to prepare AFM-4.5B for efficient inference on Google Cloud Axion Arm64 with Llama.cpp.

Sign up to Hugging Face

To download AFM-4.5B, you need to:

Install Hugging Face libraries

    

        
        
pip install huggingface_hub hf_xet --upgrade

    

This installs:

  • huggingface_hub: Python client for downloading models and datasets
  • hf_xet: Git extension for fetching large model files from Hugging Face

These tools include the hf CLI.

Log in to Hugging Face Hub

    

        
        
hf auth login

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible):

    

Please enter the token you created above, and answer ’n’ to “Add token as git credential? (Y/n)”.

Download AFM-4.5B from Hugging Face

    

        
        
hf download arcee-ai/afm-4.5B --local-dir models/afm-4-5b/

    

This command downloads the model to the models/afm-4-5b directory:

  • arcee-ai/afm-4.5B is the Hugging Face model identifier.
  • The download includes the model weights, configuration files, and tokenizer data.
  • This is a 4.5 billion parameter model, so the download can take several minutes depending on your internet connection.

Convert AFM-4.5B to GGUF format

    

        
        
python3 convert_hf_to_gguf.py models/afm-4-5b
deactivate

    

This command converts the downloaded Hugging Face model to GGUF (GGML Universal Format):

  • convert_hf_to_gguf.py is a conversion script that comes with Llama.cpp.
  • models/afm-4-5b is the input directory containing the Hugging Face model files.
  • The script reads the model architecture, weights, and configuration from the Hugging Face format.
  • It outputs a single afm-4-5B-F16.gguf ~15GB file in the same models/afm-4-5b/ directory.
  • GGUF is the native format for Llama.cpp, optimized for efficient loading and inference.

Next, deactivate the Python virtual environment as future commands won’t require it.

Create a Q4_0 quantized version

    

        
        
bin/llama-quantize models/afm-4-5b/afm-4-5B-F16.gguf models/afm-4-5b/afm-4-5B-Q4_0.gguf Q4_0

    

This command creates a 4-bit quantized version of the model:

  • llama-quantize is the quantization tool from Llama.cpp.
  • afm-4-5B-F16.gguf is the input GGUF model file in 16-bit precision.
  • Q4_0 applies zero-point 4-bit quantization.
  • This reduces the model size by approximately 70% (from ~15GB to ~4.4GB).
  • The quantized model will use less memory and run faster, though with a small reduction in accuracy.
  • The output file will be afm-4-5B-Q4_0.gguf.

Arm optimizations for quantized models

Arm has contributed optimized kernels for Q4_0 that use Neoverse V2 instruction sets. These low-level routines accelerate math operations, delivering strong performance on Axion.

These instruction sets allow Llama.cpp to run quantized operations significantly faster than generic implementations, making Arm processors a competitive choice for inference workloads.

Create a Q8_0 quantized version

    

        
        
bin/llama-quantize models/afm-4-5b/afm-4-5B-F16.gguf models/afm-4-5b/afm-4-5B-Q8_0.gguf Q8_0

    

This command creates an 8-bit quantized version of the model:

  • Q8_0 specifies 8-bit quantization with zero-point compression.
  • This reduces the model size by approximately 45% (from ~15GB to ~8GB).
  • The 8-bit version provides a better balance between memory usage and accuracy than 4-bit quantization.
  • The output file is named afm-4-5B-Q8_0.gguf.
  • Commonly used in production scenarios where memory resources are available.

Arm optimization

Similar to Q4_0, Arm has contributed optimized kernels for Q8_0 quantization that take advantage of Neoverse V2 instruction sets. These optimizations provide excellent performance for 8-bit operations while maintaining higher accuracy compared to 4-bit quantization.

AFM-4.5B models ready for inference

After completing these steps, you’ll have three versions of the AFM-4.5B model in models/afm-4-5b:

  • afm-4-5B-F16.gguf - The original full-precision model (~15GB)
  • afm-4-5B-Q4_0.gguf - 4-bit quantized version (~4.4GB) for memory-constrained environments
  • afm-4-5B-Q8_0.gguf - 8-bit quantized version (~8GB) for balanced performance and memory usage

These models are now ready to use with the Llama.cpp inference engine on Google Cloud Axion Arm64.

Back
Next