The lm-evaluation-harness is the standard way to measure model accuracy across common academic benchmarks (for example, MMLU, HellaSwag, GSM8K) and runtimes (such as Hugging Face, vLLM, and llama.cpp). In this Learning Path, you’ll run accuracy tests for both BF16 and INT4 deployments of your model served by vLLM on Arm-based servers.
You will:
Results vary based on your CPU, dataset version, and model selection. For a fair comparison between BF16 and INT4, always use the same tasks and few-shot settings.
Before you start:
Install the harness with vLLM extras in your active Python environment:
pip install "lm_eval[vllm]"
pip install ray
If your benchmarks include gated models or datasets, run huggingface-cli login first so the harness can download what it needs.
Export the same performance-oriented environment variables used for serving. These enable Arm-optimized kernels through oneDNN+ACL and consistent thread pinning:
export VLLM_TARGET_DEVICE=cpu
export VLLM_CPU_KVCACHE_SPACE=32
export VLLM_CPU_OMP_THREADS_BIND="0-$(($(nproc)-1))"
export VLLM_MLA_DISABLE=1
export ONEDNN_DEFAULT_FPMATH_MODE=BF16
export OMP_NUM_THREADS="$(nproc)"
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4
LD_PRELOAD uses tcmalloc to reduce allocator contention. Install it via sudo apt-get install -y libtcmalloc-minimal4 if you haven’t already.
Run with a non-quantized model. Replace the model ID as needed.
lm_eval \
--model vllm \
--model_args \
pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=bfloat16,max_model_len=4096,enforce_eager=True \
--tasks mmlu,hellaswag \
--batch_size auto \
--output_path results
Run accuracy tests on your INT4 quantized model using the same tasks and settings as the BF16 baseline. Replace the model path with your quantized output directory.
lm_eval \
--model vllm \
--model_args \
pretrained=Meta-Llama-3.1-8B-Instruct-w4a8dyn-mse-channelwise,dtype=float32,max_model_len=4096,enforce_eager=True \
--tasks mmlu,hellaswag \
--batch_size auto \
--output_path results
The expected output includes per-task accuracy metrics. Compare these results to your BF16 baseline to evaluate the impact of INT4 quantization on model quality.
Use the INT4 quantization recipe & script from previous steps to quantize meta-llama/Meta-Llama-3.1-8B-Instruct model.
Channelwise INT4 (MSE):
lm_eval \
--model vllm \
--model_args \
pretrained=Meta-Llama-3.1-8B-Instruct-w4a8dyn-mse-channelwise,dtype=float32,max_model_len=4096,enforce_eager=True \
--tasks mmlu,hellaswag \
--batch_size auto \
--output_path results
The harness prints per-task and aggregate scores (for example, acc, acc_norm, exact_match). Higher is generally better. Compare BF16 vs INT4 on the same tasks to assess quality impact.
Practical tips:
--limit 200 to run on a subset.These illustrative results are representative; actual scores may vary across hardware, dataset versions, and harness releases. Higher values indicate better accuracy.
| Variant | MMLU (acc±err) | HellaSwag (acc±err) |
|---|---|---|
| BF16 | 0.5897 ± 0.0049 | 0.7916 ± 0.0041 |
| INT4 Groupwise minmax (G=32) | 0.5831 ± 0.0049 | 0.7819 ± 0.0041 |
| INT4 Channelwise MSE | 0.5712 ± 0.0049 | 0.7633 ± 0.0042 |
Use these as ballpark expectations to check whether your runs are in a reasonable range, not as official targets.
Now that you’ve completed accuracy benchmarking for both BF16 and INT4 models on Arm-based servers, you’re ready to deepen your evaluation and optimize for your specific use case. Expanding your benchmarks to additional tasks helps you understand model performance across a wider range of scenarios. Experimenting with different quantization recipes lets you balance accuracy and throughput for your workload.
gsm8k, winogrande, arc_easy, arc_challenge.You’ve learned how to set up lm-evaluation-harness, run benchmarks for BF16 and INT4 models, and interpret key accuracy metrics on Arm platforms. Great job reaching this milestone—your results will help you make informed decisions about model deployment and optimization!