Who is this for?
This is an introductory topic for developers interested in building and optimizing vLLM for Arm-based servers. This Learning Path shows you how to quantize large language models (LLMs) to INT4, serve them efficiently using an OpenAI-compatible API, and benchmark model accuracy with the LM Evaluation Harness.
What will you learn?
Upon completion of this Learning Path, you will be able to:
- Build an optimized vLLM for aarch64 with oneDNN and the Arm Compute Library (ACL)
- Set up all runtime dependencies including PyTorch, llmcompressor, and Arm-optimized libraries
- Quantize an LLM (DeepSeek‑V2‑Lite) to 4-bit integer (INT4) precision
- Run and serve both quantized and BF16 (non-quantized) variants using vLLM
- Use OpenAI‑compatible endpoints and understand sequence and batch limits
- Evaluate accuracy using the LM Evaluation Harness on BF16 and INT4 models with vLLM
Prerequisites
Before starting, you will need the following:
- An Arm-based Linux server (Ubuntu 22.04+ recommended) with a minimum of 32 vCPUs, 64 GB RAM, and 64 GB free disk space
- Python 3.12 and basic familiarity with Hugging Face Transformers and quantization