In this Learning Path, you’ll run quantized Phi models with ONNX Runtime on Microsoft Azure Cobalt 100 servers.
Specifically, you’ll deploy the Phi-3.5 vision model on Arm-based servers running Ubuntu 24.04 LTS.
These instructions have been tested on a 32-core Azure Dpls_v6
instance with 32 cores, 64GB of RAM, and 32GB of disk space.
You will learn how to build and configure ONNX Runtime to enable efficient LLM inference on Arm CPUs.
This Learning Path walks you through the following tasks:
On your Arm-based server, install the following packages:
sudo apt update
sudo apt install python3-pip python3-venv cmake -y
Use a file editor of your choice and create a requirements.txt
file with the Python packages shown below:
requests
torch
transformers
accelerate
huggingface-hub
Create a virtual environment:
python3 -m venv onnx-env
Activate the virtual environment:
source onnx-env/bin/activate
Install the required libraries using pip:
pip install -r requirements.txt
Clone and build the onnxruntime-genai
repository, which includes the Kleidi AI optimized ONNX Runtime, using the following commands:
git clone https://github.com/microsoft/onnxruntime-genai.git
cd onnxruntime-genai/
python3 build.py --config Release
cd build/Linux/Release/wheel/
pip install onnxruntime_genai-0.9.0.dev0-cp312-cp312-linux_aarch64.whl
Ensure you’re using Python 3.12 to match the cp312 wheel format.
This build includes optimizations from KleidiAI for efficient inference on Arm CPUs.
Navigate to your home directory. Now download the quantized model using huggingface-cli
:
cd ~
huggingface-cli download microsoft/Phi-3.5-vision-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .
The Phi-3.5 vision model is now downloaded in ONNX format with INT4 quantization and is ready to run with ONNX Runtime.