This Learning Path demonstrates how to build an AI Agent Application using open-source LLMs optimized for the Arm architecture. The AI Agent can use Large Language Models (LLMs) to perform actions by accessing tools and knowledge.
The instructions in this Learning Path have been designed for Arm servers running Ubuntu 22.04 LTS. You will need an Arm server instance with at least 4 cores and 16GB of memory. Configure disk storage for 32 GB or more. The instructions have been tested on an AWS EC2 Graviton3 m7g.xlarge
instance.
In this Learning Path, you will learn how to build an AI agent application using llama-cpp-python
and llama-cpp-agent
. llama-cpp-python
is a Python binding for llama.cpp
that enables efficient LLM inference on Arm CPUs and llama-cpp-agent
provides an interface for processing text using agentic chains with tools.
Install the following packages on your Arm-based server instance:
sudo apt-get update
sudo apt-get upgrade
sudo apt install python3-pip python3-venv cmake -y
Create and activate a Python virtual environment:
python3 -m venv ai-agent
source ai-agent/bin/activate
Install the llama-cpp-python
package using pip:
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
Install the llama-cpp-agent
and pydantic
packages using pip:
pip install llama-cpp-agent pydantic
You are now ready to download the LLM.
Create and navigate to the models directory:
mkdir models
cd models
Install the huggingface_hub
Python library using pip
:
pip install huggingface_hub
You can now download the
pre-quantized Llama3.1 8B model
using the huggingface-cli
:
huggingface-cli download cognitivecomputations/dolphin-2.9.4-llama3.1-8b-gguf dolphin-2.9.4-llama3.1-8b-Q4_0.gguf --local-dir . --local-dir-use-symlinks False
Q4_0
in the model name refers to the quantization method the model uses. Quantization aims to reduce the model’s size (decreasing memory requirements) and increase execution speed (reducing memory bandwidth bottlenecks when transferring large amounts of data from memory to the processor). The primary trade-off when reducing a model’s size is balancing speed and size improvements against maintaining the quality of performance. Ideally, a model is quantized to meet size and speed requirements while not having a negative impact on performance.
The key aspect of quantization format is the number of bits per parameter, denoted as ‘Q4’ in this case, which represents 4-bit integers.
As of llama.cpp commit 0f1a39f3 , Arm has contributed code for performance optimization with KleidiAI kernels.
You can leverage these kernels to enhance model performance when running your model using the llama.cpp
framework.
Navigate to your home directory:
cd ~
Clone the llama.cpp
repository:
git clone https://github.com/ggerganov/llama.cpp
Build llama.cpp:
cd llama.cpp
mkdir build
cd build
cmake .. -DCMAKE_CXX_FLAGS="-mcpu=native" -DCMAKE_C_FLAGS="-mcpu=native"
cmake --build . -v --config Release -j `nproc`
The build outputs binaries to the bin
directory, and you will see that llama.cpp
is now built there.
Check that llama.cpp
has built correctly by running the help command:
cd bin
./llama-cli -h
If llama.cpp
has built correctly, you will see the help options displayed.
A snippet of the output is shown below:
usage: ./llama-cli [options]
general:
-h, --help, --usage print usage and exit
--version show version and build info
-v, --verbose print verbose information
--verbosity N set specific verbosity level (default: 0)
--verbose-prompt print a verbose prompt before generation (default: false)
--no-display-prompt don't print prompt at generation (default: false)
-co, --color colorise output to distinguish prompt and user input from generations (default: false)
-s, --seed SEED RNG seed (default: -1, use random seed for < 0)
-t, --threads N number of threads to use during generation (default: 4)
-tb, --threads-batch N number of threads to use during batch and prompt processing (default: same as --threads)
-td, --threads-draft N number of threads to use during generation (default: same as --threads)
-tbd, --threads-batch-draft N number of threads to use during batch and prompt processing (default: same as --threads-draft)
--draft N number of tokens to draft for speculative decoding (default: 5)
-ps, --p-split N speculative decoding split probability (default: 0.1)
-lcs, --lookup-cache-static FNAME
path to static lookup cache to use for lookup decoding (not updated by generation)
-lcd, --lookup-cache-dynamic FNAME
path to dynamic lookup cache to use for lookup decoding (updated by generation)
-c, --ctx-size N size of the prompt context (default: 0, 0 = loaded from model)
-n, --predict N number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)
-b, --batch-size N logical maximum batch size (default: 2048)
In the next section, you will create a Python script to execute an AI agent powered by the downloaded model.