About this Learning Path

Who is this for?

This is an introductory topic for developers interested in building and optimizing vLLM for Arm-based servers. This Learning Path shows you how to quantize large language models (LLMs) to INT4, serve them efficiently using an OpenAI-compatible API, and benchmark model accuracy with the LM Evaluation Harness.

What will you learn?

Upon completion of this Learning Path, you will be able to:

  • Build an optimized vLLM for aarch64 with oneDNN and the Arm Compute Library (ACL)
  • Set up all runtime dependencies including PyTorch, llmcompressor, and Arm-optimized libraries
  • Quantize an LLM (DeepSeek‑V2‑Lite) to 4-bit integer (INT4) precision
  • Run and serve both quantized and BF16 (non-quantized) variants using vLLM
  • Use OpenAI‑compatible endpoints and understand sequence and batch limits
  • Evaluate accuracy using the LM Evaluation Harness on BF16 and INT4 models with vLLM

Prerequisites

Before starting, you will need the following:

  • An Arm-based Linux server (Ubuntu 22.04+ recommended) with a minimum of 32 vCPUs, 64 GB RAM, and 64 GB free disk space
  • Python 3.12 and basic familiarity with Hugging Face Transformers and quantization
Next