About this Learning Path

Who is this for?

This is an introductory topic for developers interested in running inference on quantized models. In this Learning Path, you'll learn how to run inference on Llama 3.1-8B and Whisper with and without quantization. You'll then benchmark Llama performance and accuracy with vLLM's bench CLI and the LM Evaluation Harness.

What will you learn?

Upon completion of this Learning Path, you will be able to:

  • Install a recent release of vLLM
  • Run both quantized and non-quantized variants of Llama3.1-8B and Whisper using vLLM
  • Evaluate and compare model performance and accuracy using vLLM's bench CLI and the LM Evaluation Harness

Prerequisites

Before starting, you will need the following:

  • An Arm-based Linux server (Ubuntu 22.04+ recommended) with a minimum of 32 vCPUs, 96 GB RAM, and 64 GB free disk space
  • Python 3.12 and basic familiarity with Hugging Face Transformers and quantization schemes
Next