This Learning Path demonstrates how to build and deploy a Retrieval Augmented Generation (RAG) enabled chatbot using open-source Large Language Models (LLMs) optimized for Arm architecture. The chatbot processes documents, stores them in a vector database, and generates contextually-relevant responses by combining the LLM’s capabilities with retrieved information. The instructions in this Learning Path have been designed for Arm servers running Ubuntu 22.04 LTS. You need an Arm server instance with at least 16 cores, 8GB of RAM, and a 32GB disk to run this example. The instructions have been tested on a GCP c4a-standard-64 instance.
In this Learning Path, you learn how to build a RAG chatbot using llama-cpp-python, a Python binding for llama.cpp that enables efficient LLM inference on Arm CPUs.
The tutorial demonstrates how to integrate the FAISS vector database with the Llama-3.1-8B model for document retrieval, while leveraging llama-cpp-python’s optimized C++ backend for high-performance inference.
This architecture enables the chatbot to combine the model’s generative capabilities with contextual information retrieved from your documents, all optimized for Arm-based systems.
Install the following packages on your Arm based server instance:
sudo apt update
sudo apt install python3-pip python3-venv cmake -y
vim requirements.txt
Add the following dependencies to your requirements.txt
file:
# Core LLM & RAG Components
langchain==0.1.16
langchain_community==0.0.38
langchainhub==0.1.20
# Vector Database & Embeddings
faiss-cpu
sentence-transformers
# Document Processing
pypdf
PyPDF2
lxml
# API and Web Interface
flask
requests
flask_cors
streamlit
# Environment and Utils
argparse
python-dotenv==1.0.1
Create a virtual environment:
python3 -m venv rag-env
Activate the virtual environment:
source rag-env/bin/activate
Install the required libraries using pip:
pip install -r requirements.txt
Install the llama-cpp-python
package, which includes the Kleidi AI optimized llama.cpp backend, using the following command:
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
Create a directory called models, and navigate to it:
mkdir models
cd models
Download the Hugging Face model:
wget https://huggingface.co/chatpdflocal/llama3.1-8b-gguf/resolve/main/ggml-model-Q4_K_M.gguf
Navigate to your home directory:
cd ~
Clone the source repository for llama.cpp:
git clone https://github.com/ggerganov/llama.cpp
By default, llama.cpp
builds for CPU only on Linux and Windows. You do not need to provide any extra switches to build it for the Arm CPU that you run it on.
Run cmake
to build it:
cd llama.cpp
mkdir build
cd build
cmake .. -DCMAKE_CXX_FLAGS="-mcpu=native" -DCMAKE_C_FLAGS="-mcpu=native"
cmake --build . -v --config Release -j `nproc`
llama.cpp
is now built in the bin
directory.
Run the following command to quantize the model:
cd bin
./llama-quantize --allow-requantize ../../../models/ggml-model-Q4_K_M.gguf ../../../models/llama3.1-8b-instruct.Q4_0_arm.gguf Q4_0