You can use the llama.cpp
server program and submit requests using an OpenAI-compatible API.
This enables applications to be created which access the LLM multiple times without starting and stopping it. You can also access the server over the network to another machine hosting the LLM.
One additional software package is required for this section. Install jq
on your computer using:
sudo apt install jq -y
The server executable has already compiled during the stage detailed in the previous section, when you ran make
.
Start the server from the command line, it listens on port 8080:
./llama-server -m dolphin-2.9.4-llama3.1-8b-Q4_0.gguf --port 8080
You can access the API using the curl
command.
In another terminal, use a text editor to create a file named curl-test.sh
with the commands below:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "any-model",
"messages": [
{
"role": "system",
"content": "You are a coding assistant, skilled in programming."
},
{
"role": "user",
"content": "Write a hello world program in C++."
}
]
}' 2>/dev/null | jq -C
The model
value in the API is not used, you can enter any value. This is because there is only one model loaded in the server.
Run the script:
bash ./curl-test.sh
The curl
command accesses the LLM and you see the output:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "#include <iostream>\n\nint main() {\n std::cout << \"Hello, World!\";\n return 0;\n}",
"role": "assistant"
}
}
],
"created": 1733756813,
"model": "any-model",
"object": "chat.completion",
"usage": {
"completion_tokens": 25,
"prompt_tokens": 33,
"total_tokens": 58
},
"id": "chatcmpl-xMWf1T4FYHtYu830y8yAzVdiSfwW1x4V",
"timings": {
"prompt_n": 33,
"prompt_ms": 59.956,
"prompt_per_token_ms": 1.816848484848485,
"prompt_per_second": 550.403629328174,
"predicted_n": 25,
"predicted_ms": 361.283,
"predicted_per_token_ms": 14.45132,
"predicted_per_second": 69.19783106318316
}
}
In the returned JSON data you see the LLM output, including the content created from the prompt.
You can also use a Python program to access the OpenAI-compatible API.
Create a Python venv
:
python -m venv pytest
source pytest/bin/activate
Install the OpenAI Python package:
pip install openai==1.55.3
Use a text editor to create a file named python-test.py
with the content below:
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:8080/v1',
api_key='no-key'
)
completion = client.chat.completions.create(
model="not-used",
messages=[
{"role": "system", "content": "You are a coding assistant, skilled in programming.."},
{"role": "user", "content": "Write a hello world program in C++."}
],
stream=True,
)
for chunk in completion:
print(chunk.choices[0].delta.content or "", end="")
Run the Python file (make sure the server is still running):
python ./python-test.py
You see the output generated by the LLM:
Here's a simple Hello World program in C++:
```cpp
#include <iostream>
int main() {
std::cout << "Hello, World!" << std::endl;
return 0;
}
In this program, we include the iostream library, which allows us to use cout for output. We then print "Hello, World!" to the console using cout. Finally, we return 0 to indicate that the program has finished successfully.
You can continue to experiment with different large language models and write scripts to try them.