Review your AFM-4.5B deployment on Axion

Log an issue

Fork and edit

Discuss on Discord

Review your AFM-4.5B deployment on Google Cloud Axion

Congratulations! You have successfully deployed the AFM-4.5B foundation model on Google Cloud Axion Arm64.

Here’s a summary of what you built and how to extend it.

Using this Learning Path, you have:

Launched an Axion-powered Google Cloud instance – you set up a c4a instance running Ubuntu 24.04 LTS, leveraging Arm-based compute for optimal price–performance.
Configured the development environment – you installed tools and dependencies, including Git, build tools, and Python packages for machine learning workloads.
Built Llama.cpp from source – you compiled the inference engine specifically for the Arm64 architecture to maximize performance on Axion.
Downloaded and optimized AFM-4.5B – you retrieved the 4.5-billion-parameter Arcee Foundation Model, converted it to the GGUF format, and created quantized versions (8-bit and 4-bit) to reduce memory usage and improve speed.
Ran inference and evaluation – you tested the model using interactive sessions and API endpoints, and benchmarked speed, memory usage, and model quality.

The benchmarking results demonstrate the power of quantization and Arm-based computing:

Memory efficiency – the 4-bit model uses only ~3 GB of RAM compared to ~9 GB for the full-precision version
Speed improvements – inference with Q4_0 is 2.5x faster (~60+ tokens/sec vs. 25 tokens/sec)
Cost optimization – lower memory needs enable smaller, more affordable instances
Quality preservation – the quantized models maintain strong perplexity scores, showing minimal quality loss

Google Cloud Axion processors, based on Arm Neoverse V2, provide:

Now that you have a working deployment, you can extend it further.

Production deployment:

Application development:

Together, Arcee AI’s foundation models, Llama.cpp’s efficient runtime, and Google Cloud Axion provide a scalable, cost-efficient platform for AI.

From chatbots and content generation to research tools, this stack delivers a balance of performance, cost, and developer control.

For more information on Arcee AI, and how you can build high-quality, secure, and cost-efficient AI solutions, please visit www.arcee.ai .

Back