About this Learning Path

Skill level:	Introductory
Reading time:	30 min
Last updated:	05 Sep 2025

Skill level:

Introductory

Reading time:

30 min

Last updated:

05 Sep 2025

Authors:	Aryan Bhusari, Arm Joe Stech, Arm
Arm IP:	Neoverse
Tags:	ML Linux LLM Generative AI AWS

Authors:

Aryan Bhusari, Arm
Joe Stech, Arm

Arm IP:

Neoverse

Tags:

Linux

LLM

Generative AI

AWS

Who is this for?

This introductory topic is for developers with some experience using llama.cpp who want to learn how to run distributed inference on Arm-based servers.

What will you learn?

Upon completion of this Learning Path, you will be able to:

Set up a main host and worker nodes with llama.cpp
Run a large quantized model (for example, Llama 3.1 405B) with distributed CPU inference on Arm machines

Prerequisites

Before starting, you will need the following:

Three AWS c8g.4xlarge instances with at least 500 GB of EBS storage
Python 3 installed on each instance
Access to Meta’s gated repository for the Llama 3.1 model family and a Hugging Face token to download models
Familiarity with the Learning Path Deploy a Large Language Model (LLM) chatbot with llama.cpp using KleidiAI on Arm servers
Familiarity with AWS

Run distributed inference with llama.cpp on Arm-based AWS Graviton4 instances

Introduction

Convert model to GGUF and quantize

Configure the worker nodes

Configure the master node

Next Steps

Run distributed inference with llama.cpp on Arm-based AWS Graviton4 instances

About this Learning Path

Who is this for?

What will you learn?

Prerequisites