Run a Large Language model (LLM) chatbot on Arm servers

Run a Large Language Model (LLM) chatbot with PyTorch using KleidiAI on Arm servers

Log an issue

Fork and edit

Discuss on Discord

Run a Large Language Model (LLM) chatbot with PyTorch using KleidiAI on Arm servers

Before you begin

The instructions in this Learning Path are for any Arm server running Ubuntu 24.04 LTS. You need an Arm server instance with at least 16 cores and 64GB of RAM to run this example. Configure disk storage up to at least 50 GB. The instructions have been tested on an AWS Graviton4 r8g.4xlarge instance.

Overview

Arm CPUs are widely used in traditional ML and AI use cases. In this Learning Path, you learn how to run generative AI inference-based use cases like a LLM chatbot using PyTorch on Arm-based CPUs. PyTorch is a popular deep learning framework for AI applications.

This learning path shows you how you can run the Meta Llama 3.1 model using PyTorch on Arm Neoverse V2-based AWS Graviton4 CPUs with KleidiAI optimizations, showcasing significant performance improvements.

Install necessary tools

First, ensure your system is up-to-date and install the required tools:

    

        
        
sudo apt-get update
sudo apt install gcc g++ build-essential python3-pip python3-venv google-perftools -y

Set up your Python virtual environment

Set up a Python virtual environment to isolate dependencies:

    

        
        
python3 -m venv torch_env
source torch_env/bin/activate

Install PyTorch and optimized libraries

Torchchat is a library developed by the PyTorch team that facilitates running large language models (LLMs) seamlessly on a variety of devices. TorchAO (Torch Architecture Optimization) is a PyTorch library designed for enhancing the performance of ML models through different quantization and sparsity methods.

Start by installing pytorch and cloning the torchao and torchchat repositories:

    

        
        
pip install torch
git clone --recursive https://github.com/pytorch/ao.git
cd ao
git checkout e1cb44ab84eee0a3573bb161d65c18661dc4a307
curl -L https://github.com/pytorch/ao/commit/738d7f2c5a48367822f2bf9d538160d19f02341e.patch | git apply
python3 setup.py install
cd ../

git clone --recursive https://github.com/pytorch/torchchat.git
cd torchchat
git checkout 925b7bd73f110dd1fb378ef80d17f0c6a47031a6
wget https://raw.githubusercontent.com/ArmDeveloperEcosystem/PyTorch-arm-patches/main/0001-modified-generate.py-for-cli-and-browser.patch
wget https://raw.githubusercontent.com/ArmDeveloperEcosystem/PyTorch-arm-patches/main/0001-Feat-Enable-int4-quantized-models-to-work-with-pytor.patch
git apply 0001-Feat-Enable-int4-quantized-models-to-work-with-pytor.patch
git apply --whitespace=nowarn 0001-modified-generate.py-for-cli-and-browser.patch
sed -i 's/"groupsize": 0/"groupsize": 32/' config/data/aarch64_cpu_channelwise.json
pip install -r requirements.txt

You can now download the LLM.

Install the Hugging Face CLI application.

    

        
        
pip install -U "huggingface_hub[cli]"

Generate an Access Token to authenticate your identity with Hugging Face Hub. A token with read-only access is sufficient.

    

        
        
huggingface-cli login

Before you can download the model, you must accept the license agreement at: Meta Llama 3.1 .

Downloading and Quantizing the LLM Model

In this step, you will download the Meta Llama3.1 8B Instruct model and quantize the model to int 4-bit using Kernels for PyTorch. By using channel-wise quantization, the weights are quantized independently across different channels, or groups of channels. This can improve accuracy over simpler quantization methods.

    

        
        
python3 torchchat.py export llama3.1 --output-dso-path exportedModels/llama3.1.so --quantize config/data/aarch64_cpu_channelwise.json --device cpu --max-seq-length 1024

The output from this command should look like:

    

        
        linear: layers.31.feed_forward.w1, in=4096, out=14336
linear: layers.31.feed_forward.w2, in=14336, out=4096
linear: layers.31.feed_forward.w3, in=4096, out=14336
linear: output, in=4096, out=128256
Time to quantize model: 44.14 seconds
-----------------------------------------------------------
Exporting model using AOT Inductor to /home/ubuntu/torchchat/exportedModels/llama3.1.so
The generated DSO model can be found at: /home/ubuntu/torchchat/exportedModels/llama3.1.so

Running LLM Inference on Arm CPU

You can now run the LLM on the Arm CPU on your server.

To run the model inference:

    

        
        
LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4 TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python3 torchchat.py generate llama3.1 --dso-path exportedModels/llama3.1.so --device cpu --max-new-tokens 32 --chat

The output from running the inference will look like:

    

        
        PyTorch version 2.5.0.dev20240828+cpu available.
Warning: checkpoint path ignored because an exported DSO or PTE path specified
Warning: checkpoint path ignored because an exported DSO or PTE path specified
Using device=cpu
Loading model...
Time to load model: 0.04 seconds
-----------------------------------------------------------
Starting Interactive Chat
Entering Chat Mode. Will continue chatting back and forth with the language model until the models max context length of 8192 tokens is hit or until the user says /bye
Do you want to enter a system prompt? Enter y for yes and anything else for no.
no
User: What's the weather in Boston like?
Model: Boston, Massachusetts!

Boston's weather is known for being temperamental and unpredictable, especially during the spring and fall months. Here's a breakdown of the typical
=====================================================================
Input tokens        :   18
Generated tokens    :   32
Time to first token :   0.66 s
Prefill Speed       :   27.36 t/s
Generation  Speed   :   24.6 t/s
=====================================================================

Bandwidth achieved: 254.17 GB/s
*** This first iteration will include cold start effects for dynamic import, hardware caches. ***

You have successfully run the Llama3.1 8B Instruct Model on your Arm-based server. In the next section, you will walk through the steps to run the same chatbot in your browser.

Back

Run a Large Language Model (LLM) chatbot with PyTorch using KleidiAI on Arm servers

Introduction

Run a Large Language model (LLM) chatbot on Arm servers

Chatbot with Streamlit Frontend

Next Steps

Run a Large Language Model (LLM) chatbot with PyTorch using KleidiAI on Arm servers

Before you begin

Overview

Install necessary tools

Set up your Python virtual environment

Install PyTorch and optimized libraries

Downloading and Quantizing the LLM Model

Running LLM Inference on Arm CPU

Run a Large Language model (LLM) chatbot on Arm servers

Run a Large Language Model (LLM) chatbot with PyTorch using KleidiAI on Arm servers

Introduction

Run a Large Language model (LLM) chatbot on Arm servers

Chatbot with Streamlit Frontend

Next Steps

Run a Large Language Model (LLM) chatbot with PyTorch using KleidiAI on Arm servers

Before you begin

Overview

Install necessary tools

Set up your Python virtual environment

Install PyTorch and optimized libraries

Login to Hugging Face

Downloading and Quantizing the LLM Model

Running LLM Inference on Arm CPU