About this Learning Path

Skill level:	Introductory
Reading time:	1 hr
Last updated:	23 Feb 2026

Skill level:

Introductory

Reading time:

1 hr

Last updated:

23 Feb 2026

Author:	Nikhil Gupta
Arm IP:	Neoverse
Tags:	ML AWS Microsoft Azure Google Cloud Oracle Linux vLLM LM Evaluation Harness LLM Generative AI Python PyTorch Hugging Face

Author:

Nikhil Gupta

Arm IP:

Neoverse

Tags:

AWS

Microsoft Azure

Google Cloud

Oracle

Linux

vLLM

LM Evaluation Harness

LLM

Generative AI

Python

PyTorch

Hugging Face

Who is this for?

This is an introductory topic for developers interested in building and optimizing vLLM for Arm-based servers. This Learning Path shows you how to quantize large language models (LLMs) to INT4, serve them using an OpenAI-compatible API, and benchmark model accuracy with the LM Evaluation Harness.

What will you learn?

Upon completion of this Learning Path, you will be able to:

Build an optimized vLLM for aarch64 with oneDNN and the Arm Compute Library (ACL)
Set up all runtime dependencies including PyTorch, llmcompressor, and Arm-optimized libraries
Quantize an LLM (DeepSeek‑V2‑Lite) to 4-bit integer (INT4) precision
Run and serve both quantized and BF16 (non-quantized) variants using vLLM
Use OpenAI‑compatible endpoints and understand sequence and batch limits
Evaluate accuracy using the LM Evaluation Harness on BF16 and INT4 models with vLLM

Prerequisites

Before starting, you will need the following:

An Arm-based Linux server (Ubuntu 22.04+ recommended) with a minimum of 32 vCPUs, 64 GB RAM, and 64 GB free disk space
Python 3.12 and basic familiarity with Hugging Face Transformers and quantization

Run vLLM inference with INT4 quantization on Arm servers

Introduction

Build and validate vLLM for inference

Quantize an LLM to INT4

Serve high throughput inference with vLLM

Evaluate accuracy with LM Evaluation Harness

Next Steps

Run vLLM inference with INT4 quantization on Arm servers

About this Learning Path

Who is this for?

What will you learn?

Prerequisites