Access the chatbot using the OpenAI-compatible API

Deploy a Large Language Model (LLM) chatbot with llama.cpp using KleidiAI on Arm servers

Log an issue

Fork and edit

Discuss on Discord

Deploy a Large Language Model (LLM) chatbot with llama.cpp using KleidiAI on Arm servers

You can use the llama.cpp server program and submit requests using an OpenAI-compatible API. This enables applications to be created which access the LLM multiple times without starting and stopping it. You can also access the server over the network to another machine hosting the LLM.

One additional software package is required for this section. Install jq on your computer using:

    

        
        
sudo apt install jq -y

The server executable has already compiled during the stage detailed in the previous section, when you ran make.

Start the server from the command line, it listens on port 8080:

    

        
        
./llama-server -m dolphin-2.9.4-llama3.1-8b-Q4_0.gguf --port 8080

Use curl

You can access the API using the curl command.

In another terminal, use a text editor to create a file named curl-test.sh with the commands below:

    

        
        
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json"   -d '{
    "model": "any-model",
    "messages": [
      {
        "role": "system",
        "content": "You are a coding  assistant, skilled in programming."
      },
      {
        "role": "user",
        "content": "Write a hello world program in C++."
      }
    ]
  }' 2>/dev/null | jq -C

The model value in the API is not used, you can enter any value. This is because there is only one model loaded in the server.

Run the script:

    

        
        
bash ./curl-test.sh

The curl command accesses the LLM and you see the output:

    

        
        {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "#include <iostream>\n\nint main() {\n    std::cout << \"Hello, World!\";\n    return 0;\n}",
        "role": "assistant"
      }
    }
  ],
  "created": 1733756813,
  "model": "any-model",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 25,
    "prompt_tokens": 33,
    "total_tokens": 58
  },
  "id": "chatcmpl-xMWf1T4FYHtYu830y8yAzVdiSfwW1x4V",
  "timings": {
    "prompt_n": 33,
    "prompt_ms": 59.956,
    "prompt_per_token_ms": 1.816848484848485,
    "prompt_per_second": 550.403629328174,
    "predicted_n": 25,
    "predicted_ms": 361.283,
    "predicted_per_token_ms": 14.45132,
    "predicted_per_second": 69.19783106318316
  }
}

In the returned JSON data you see the LLM output, including the content created from the prompt.

Use Python

You can also use a Python program to access the OpenAI-compatible API.

Create a Python venv:

    

        
        
python -m venv pytest
source pytest/bin/activate

Install the OpenAI Python package:

    

        
        
pip install openai==1.55.3

Use a text editor to create a file named python-test.py with the content below:

    

        
        
from openai import OpenAI

client = OpenAI(
        base_url='http://localhost:8080/v1',
        api_key='no-key'
        )

completion = client.chat.completions.create(
  model="not-used",
  messages=[
    {"role": "system", "content": "You are a coding assistant, skilled in programming.."},
    {"role": "user", "content": "Write a hello world program in C++."}
  ],
  stream=True,
)

for chunk in completion:
  print(chunk.choices[0].delta.content or "", end="")

Run the Python file (make sure the server is still running):

    

        
        
python ./python-test.py

You see the output generated by the LLM:

    

        
        Here's a simple Hello World program in C++:

```cpp
#include <iostream>

int main() {
    std::cout << "Hello, World!" << std::endl;
    return 0;
}

In this program, we include the iostream library, which allows us to use cout for output. We then print "Hello, World!" to the console using cout. Finally, we return 0 to indicate that the program has finished successfully.

You can continue to experiment with different large language models and write scripts to try them.

Back

Deploy a Large Language Model (LLM) chatbot with llama.cpp using KleidiAI on Arm servers

Introduction

Demo

Run a Large Language model (LLM) chatbot on Arm servers

Access the chatbot using the OpenAI-compatible API

Next Steps

Deploy a Large Language Model (LLM) chatbot with llama.cpp using KleidiAI on Arm servers

Use curl

Use Python