You can now move on to using the rtp-llm
server program and submitting requests using an OpenAI-compatible API.
This enables applications to be created which access the LLM multiple times without starting and stopping it.
You can also access the server over the network to another machine hosting the LLM.
One additional software package is required for this section.
Install jq
on your computer using the following commands:
sudo apt install jq -y
There are a few different ways you can download the Qwen2 0.5B model. In this Learning Path, you will download the model from Hugging Face.
Hugging Face is an open source AI community where you can host your own AI models, train them, and collaborate with others in the community. You can browse through thousands of models that are available for a variety of use cases such as Natural Language Processing (NLP), audio, and computer vision.
The huggingface_hub
library provides APIs and tools that let you easily download and fine-tune pre-trained models. You will use huggingface-cli
to download the
Qwen2 0.5B model
.
Install the required Python packages:
sudo apt install python-is-python3 python3-pip python3-venv -y
Create and activate a Python virtual environment:
python -m venv venv
source venv/bin/activate
Your terminal prompt now has the (venv)
prefix indicating the virtual environment is active. Use this virtual environment for the remaining commands.
Install the huggingface_hub
python library using pip
:
pip install huggingface_hub
You can now download the model using the huggingface cli:
huggingface-cli download Qwen/Qwen2-0.5B-Instruct
The server executable compiled during the previous stage, when you ran bazelisk build
.
Install the pip wheel in your active virtual environment:
pip install bazel-bin/maga_transformer/maga_transformer-0.2.0-cp310-cp310-linux_aarch64.whl
pip install grpcio-tools
Start the server from the command line. It listens on port 8088:
export CHECKPOINT_PATH=${HOME}/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/
export TOKENIZER_PATH=$CHECKPOINT_PATH
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
MODEL_TYPE=qwen_2 FT_SERVER_TEST=1 python3 -m maga_transformer.start_server
You can access the API using the curl
command.
In another terminal, use a text editor to create a file named curl-test.sh
with the content below:
curl http://localhost:8088/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "any-model",
"messages": [
{
"role": "system",
"content": "You are a coding assistant, skilled in programming."
},
{
"role": "user",
"content": "Write a hello world program in C++."
}
]
}' 2>/dev/null | jq -C
The model
value in the API is not used, and you can enter any value. This is because there is only one model loaded in the server.
Run the script:
bash ./curl-test.sh
The curl
command accesses the LLM and you should see the output:
{
"id": "chat-",
"object": "chat.completion",
"created": 1730277073,
"model": "AsyncModel",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Sure, here's a simple C++ program that prints \"Hello, World!\" to the console:\n\n```cpp\n#include <iostream>\n\nint main() {\n std::cout << \"Hello, World!\" << std::endl;\n return 0;\n}\n```\n\nThis program includes the `iostream` library, which is used for input/output operations. The `main` function is the entry point of the program, and it calls the `cout` object to print the message \"Hello, World!\" to the console."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 32,
"total_tokens": 137,
"completion_tokens": 105
}
}
In the returned JSON data, you will see the LLM output, including the content created from the prompt.
You can also use a Python program to access the OpenAI-compatible API.
Create a Python venv
:
python -m venv pytest
source pytest/bin/activate
Install the OpenAI Python package:
pip install openai==1.45.0
Use a text editor to create a file named python-test.py
with the content below:
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:8088/v1',
api_key='no-key'
)
completion = client.chat.completions.create(
model="not-used",
messages=[
{"role": "system", "content": "You are a coding assistant, skilled in programming.."},
{"role": "user", "content": "Write a hello world program in C++."}
],
stream=True,
)
for chunk in completion:
print(chunk.choices[0].delta.content or "", end="")
Ensure that the server is still running, and then run the Python file:
python ./python-test.py
You should see the output generated by the LLM:
Sure, here's a simple C++ program that prints "Hello, World!" to the console:
```cpp
#include <iostream>
int main() {
std::cout << "Hello, World!" << std::endl;
return 0;
}
This program includes the `iostream` library, which is used for input/output operations. The `main` function is the entry point of the program, and it calls the `cout` object to print the message "Hello, World!" to the console.
Now you can continue to experiment with different large language models, and have a go at writing scripts to access them.