Overview

In this Learning Path, you’ll run quantized Phi models with ONNX Runtime on Microsoft Azure Cobalt 100 servers.

Specifically, you’ll deploy the Phi-3.5 vision model on Arm-based servers running Ubuntu 24.04 LTS.

Note

These instructions have been tested on a 32-core Azure Dpls_v6 instance with 32 cores, 64GB of RAM, and 32GB of disk space.

You will learn how to build and configure ONNX Runtime to enable efficient LLM inference on Arm CPUs.

This Learning Path walks you through the following tasks:

  • Build ONNX Runtime.
  • Quantize and convert the Phi-3.5 vision model to ONNX format.
  • Run the model using a Python script with ONNX Runtime for CPU-based LLM inference.
  • Analyze performance on Arm CPUs.

Install dependencies

On your Arm-based server, install the following packages:

    

        
        
    sudo apt update
    sudo apt install python3-pip python3-venv cmake -y

    

Create a requirements file

Use a file editor of your choice and create a requirements.txt file with the Python packages shown below:

    

        
        
    requests
    torch
    transformers
    accelerate
    huggingface-hub

    

Install Python dependencies

Create a virtual environment:

    

        
        
    python3 -m venv onnx-env

    

Activate the virtual environment:

    

        
        
    source onnx-env/bin/activate

    

Install the required libraries using pip:

    

        
        
    pip install -r requirements.txt

    

Clone and build ONNX Runtime

Clone and build the onnxruntime-genai repository, which includes the Kleidi AI optimized ONNX Runtime, using the following commands:

    

        
        
    git clone https://github.com/microsoft/onnxruntime-genai.git
    cd onnxruntime-genai/
    python3 build.py --config Release
    cd build/Linux/Release/wheel/
    pip install onnxruntime_genai-0.9.0.dev0-cp312-cp312-linux_aarch64.whl

    
Note

Ensure you’re using Python 3.12 to match the cp312 wheel format.

This build includes optimizations from KleidiAI for efficient inference on Arm CPUs.

Download and quantize the model

Navigate to your home directory. Now download the quantized model using huggingface-cli:

    

        
        
    cd ~
    huggingface-cli download microsoft/Phi-3.5-vision-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .

    

The Phi-3.5 vision model is now downloaded in ONNX format with INT4 quantization and is ready to run with ONNX Runtime.

Back
Next