Understand W8A8 quantization for vLLM models

Run vLLM inference with quantized models and benchmark on Arm servers

Log an issue

Fork and edit

Discuss on Discord

Run vLLM inference with quantized models and benchmark on Arm servers

How W8A8 quantization works

Quantized models have their weights converted to a lower precision data type, which reduces the memory requirements of the model and can improve performance significantly. There are many publicly available quantized versions of popular models, such as RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 and RedHatAI/whisper-large-v3-quantized.w8a8 , which you’ll use in this Learning Path.

The notation w8a8 means that the weights have been quantized to 8-bit integers and the activations (the input data) are dynamically quantized to the same. This allows Arm’s 8-bit integer matrix multiply feature I8MM to be used. For more information, see the KleidiAI and matrix multiplication Learning Path.

The w8a8 models that you’ll use in this Learning Path apply quantization only to the weights and activations in the linear layers of the transformer blocks. The activation quantizations are applied per-token and the weights are quantized per-channel. That is, each output channel dimension has a scaling factor applied between INT8 and BF16 representations.

(Optional) Quantize your own models

Note

This step is optional. You’ll be using pre-quantized models from Hugging Face for the rest of this Learning Path, so you don’t need to run this recipe. Quantizing a model yourself can take several hours.

To learn more about quantizing your own models, see the Run vLLM inference with INT4 quantization on Arm servers Learning Path.

If you prefer to generate your own w8a8 quantized model rather than using the pre-quantized Red Hat models, you can use the following recipe. Install the required packages before running the quantization script:

Note

The following commands use specific package versions that were tested with this recipe. To find the latest versions, see llmcompressor , compressed-tensors , and datasets on GitHub.

    

        
        
pip install compressed-tensors==0.14.0.1
pip install llmcompressor==0.10.0.1
pip install datasets==4.6.0

The script uses Generalized Post-Training Quantization (GPTQ) to calibrate the quantization scales. It loads 256 samples from a calibration dataset, runs a forward pass through each linear layer, and computes per-channel weight scales and per-token activation scales. The output is saved as a quantized model in the Meta-Llama-3.1-8B-quantized.w8a8 directory.

Create a file named w8a8_quant.py with the following content:

    

        
        
from transformers import AutoTokenizer
from datasets import Dataset, load_dataset
from transformers import AutoModelForCausalLM
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from compressed_tensors.quantization import QuantizationType, QuantizationStrategy
import random
 
model_id = "meta-llama/Meta-Llama-3.1-8B"  # Note: this uses the Meta-prefixed model ID required by llmcompressor
 
num_samples = 256
max_seq_len = 4096
 
tokenizer = AutoTokenizer.from_pretrained(model_id)
 
def preprocess_fn(example):
  return {"text": example["text"]}
 
ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.shuffle().select(range(num_samples))
ds = ds.map(preprocess_fn)
 
scheme = {
        "targets": ["Linear"],
        "weights": {
            "num_bits": 8,
            "type": QuantizationType.INT,
            "strategy": QuantizationStrategy.CHANNEL,
            "symmetric": True,
            "dynamic": False,
            "group_size": None
        },
        "input_activations":
            {
            "num_bits": 8,
            "type": QuantizationType.INT,
            "strategy": QuantizationStrategy.TOKEN,
            "dynamic": True,
            "symmetric": False,
            "observer": None,
        },
        "output_activations": None,
}
 
recipe = GPTQModifier(
  targets="Linear",
  config_groups={"group_0": scheme},
  ignore=["lm_head"],
  dampening_frac=0.01,
  block_size=512,
)
 
model = AutoModelForCausalLM.from_pretrained(
  model_id,
  device_map="auto",
  trust_remote_code=True,
)
 
oneshot(
  model=model,
  dataset=ds,
  recipe=recipe,
  max_seq_length=max_seq_len,
  num_calibration_samples=num_samples,
)
model.save_pretrained("Meta-Llama-3.1-8B-quantized.w8a8")

Run the script. This step can take several hours depending on your hardware:

    

        
        
python w8a8_quant.py

When quantization is completed, copy the tokenizer files from the original model into your quantized model directory before running inference:

    

        
        
for f in tokenizer.json tokenizer_config.json special_tokens_map.json tokenizer.model; do
  cp ~/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3.1-8B/snapshots/*/"$f" Meta-Llama-3.1-8B-quantized.w8a8/ 2>/dev/null || true
done

What you’ve accomplished and what’s next

You’ve now learned about quantization of vLLM models and the steps to quantize a model if you want to.

Next, you’ll use vLLM to run inference on both quantized and non-quantized models and compare their outputs.

Back

Run vLLM inference with quantized models and benchmark on Arm servers

Introduction

Set up vLLM

Understand W8A8 quantization for vLLM models

Run inference with vLLM

Evaluate Llama 3.1 8B throughput and accuracy

Next Steps

Run vLLM inference with quantized models and benchmark on Arm servers

How W8A8 quantization works

(Optional) Quantize your own models

What you’ve accomplished and what’s next