You can use the llama.cpp server program and submit requests using an OpenAI compatible API. This enables applications to be created which access the LLM multiple times without starting and stopping it. You can also access the server over the network to another machine hosting the LLM.

One additional software package is required for this section. Install jq on your computer using:

    

        
        
            sudo apt install jq -y
        
    

The server executable was already compiled when you ran make in the previous section.

Start the server from the command line, it listens on port 8080:

    

        
        
            ./llama-server -m llama-2-7b-chat.Q4_0_8_8.gguf --port 8080
        
    

Use curl

You can access the API using the curl command.

In another terminal, use a text editor to create a file named curl-test.sh with the commands below:

    

        
        
            curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json"   -d '{
    "model": "any-model",
    "messages": [
      {
        "role": "system",
        "content": "You are a coding  assistant, skilled in programming."
      },
      {
        "role": "user",
        "content": "Write a hello world program in C++."
      }
    ]
  }' 2>/dev/null | jq -C
        
    

The model value in the API is not used, you can enter any value. This is because there is only one model loaded in the server.

Run the script:

    

        
        
            bash ./curl-test.sh
        
    

The curl command accesses the LLM and you see the output:

    

        
        {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Certainly! Here is a simple \"Hello World\" program in C++:\n```\n#include <iostream>\n\nint main() {\n    std::cout << \"Hello, World!\" << std::endl;\n    return 0;\n}\n```\nThis program will print \"Hello, World!\" to the console when run. Let me know if you have any questions or if you would like to learn more about C++!\n<|im_end|>\n\n",
        "role": "assistant"
      }
    }
  ],
  "created": 1714512615,
  "model": "any-model",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 104,
    "prompt_tokens": 64,
    "total_tokens": 168
  },
  "id": "chatcmpl-FlYmMwFbctdfrY7JkoL8wRO6Qka9YYd8"
}

        
    

In the returned JSON data you see the LLM output, including the content created from the prompt.

Use Python

You can also use a Python program to access the OpenAI compatible API.

Install the OpenAI Python package.

    

        
        
            pip install openai
        
    

Use a text editor to create a file named python-test.py with the content below:

    

        
        
            from openai import OpenAI

client = OpenAI(
        base_url='http://localhost:8080/v1',
        api_key='no-key'
        )

completion = client.chat.completions.create(
  model="not-used",
  messages=[
    {"role": "system", "content": "You are a coding assistant, skilled in programming.."},
    {"role": "user", "content": "Write a hello world program in C++."}
  ],
  stream=True,
)

for chunk in completion:
  print(chunk.choices[0].delta.content or "", end="")
        
    

Run the Python file (make sure the server is still running):

    

        
        
            python ./python-test.py
        
    

You see the output generated by the LLM:

    

        
        Certainly! Here is a simple "Hello World" program in C++:

#include <iostream>

int main() {
    std::cout << "Hello, World!" << std::endl;
    return 0;
}

This program will print "Hello, World!" to the console when run. Let me know if you have any questions or if you would like to learn more about C++!

        
    

You can continue to experiment with different large language models and write scripts to try them.

Back
Next