About this Learning Path

Skill level:	Introductory
Reading time:	1 hr
Last updated:	17 Jun 2026

Skill level:

Introductory

Reading time:

1 hr

Last updated:

17 Jun 2026

Authors:	Anna Mayne Nikhil Gupta Marek Michałowski
Arm IP:	Neoverse
Tags:	ML Linux vLLM Python PyTorch Hugging Face

Authors:

Anna Mayne
Nikhil Gupta
Marek Michałowski

Arm IP:

Neoverse

Tags:

Linux

vLLM

Python

PyTorch

Hugging Face

Who is this for?

This is an introductory topic for developers interested in running inference on quantized models. In this Learning Path, you'll learn how to run inference on Llama 3.1-8B and Whisper with and without quantization. You'll then benchmark Llama performance and accuracy with vLLM's bench CLI and the LM Evaluation Harness.

What will you learn?

Upon completion of this Learning Path, you will be able to:

Install a recent release of vLLM
Run both quantized and non-quantized variants of Llama3.1-8B and Whisper using vLLM
Evaluate and compare model performance and accuracy using vLLM's bench CLI and the LM Evaluation Harness

Prerequisites

Before starting, you will need the following:

An Arm-based Linux server (Ubuntu 22.04+ recommended) with a minimum of 32 vCPUs, 96 GB RAM, and 64 GB free disk space
Python 3.12 and basic familiarity with Hugging Face Transformers and quantization schemes

Run vLLM inference with quantized models and benchmark on Arm servers

Introduction

Set up vLLM

Understand W8A8 quantization for vLLM models

Run inference with vLLM

Evaluate Llama 3.1 8B throughput and accuracy

Next Steps

Run vLLM inference with quantized models and benchmark on Arm servers

About this Learning Path

Who is this for?

What will you learn?

Prerequisites