Deploy Arcee AFM-4.5B on Arm-based AWS Graviton4 with Llama.cpp: Run inference with AFM-4.5B

Deploy Arcee AFM-4.5B on Arm-based AWS Graviton4 with Llama.cpp

Log an issue

Fork and edit

Discuss on Discord

Deploy Arcee AFM-4.5B on Arm-based AWS Graviton4 with Llama.cpp

Now that you have the AFM-4.5B models in GGUF format, you can run inference using various Llama.cpp tools. In this step, you’ll explore how to generate text, benchmark performance, and interact with the model through both command-line and HTTP APIs.

Use llama-cli for interactive text generation

The llama-cli tool provides an interactive command-line interface for text generation. This is ideal for quick testing and hands-on exploration of the model’s behavior.

Basic usage

    

        
        
bin/llama-cli -m models/afm-4-5b/afm-4-5B-Q8_0.gguf -n 256 --color

This command starts an interactive session:

-m (model file path) specifies the model file to load
-n 256 sets the maximum number of tokens to generate per response
--color enables colored terminal output
The tool will prompt you to enter text, and the model will generate a response

In this example, llama-cli uses 16 vCPUs. You can try different values with -t <number>.

Example interactive session

Once you start the interactive session, you can have conversations like this:

    

        
        
> Give me a brief explanation of the attention mechanism in transformer models.
In transformer models, the attention mechanism allows the model to focus on specific parts of the input sequence when computing the output. Here's a simplified explanation:

1. **Key-Query-Value (K-Q-V) computation**: For each input element, the model computes three vectors:
   - **Key (K)**: This represents the input element in a way that's useful for computing attention weights.
   - **Query (Q)**: This represents the current input element being processed and is used to compute attention weights.
   - **Value (V)**: This represents the input element in its original form, which is used to compute the output based on attention weights.

2. **Attention scores computation**: The attention mechanism computes the similarity between the Query (Q) and each Key (K) element using dot product and softmax normalization. This produces a set of attention scores, which represent how relevant each Key (K) element is to the Query (Q).

3. **Weighted sum**: The attention scores are used to compute a weighted sum of the Value (V) elements. The output is a weighted sum of the Values (V) based on the attention scores.

4. **Output**: The final output is a vector that represents the context of the input sequence, taking into account the attention scores. This output is used in the decoder to generate the next word in the output sequence.

The attention mechanism allows transformer models to selectively focus on specific parts of the input sequence, enabling them to better understand context and relationships between input elements. This is particularly useful for tasks like machine translation, where the model needs to capture long-range dependencies between input words.

To exit the session, type Ctrl+C or /bye.

You’ll then see performance metrics like this:

    

        
        
llama_perf_sampler_print:    sampling time =       9.47 ms /   119 runs   (    0.08 ms per token, 12569.98 tokens per second)
llama_perf_context_print:        load time =     616.69 ms
llama_perf_context_print: prompt eval time =     344.39 ms /    23 tokens (   14.97 ms per token,    66.79 tokens per second)
llama_perf_context_print:        eval time =    9289.81 ms /   352 runs   (   26.39 ms per token,    37.89 tokens per second)
llama_perf_context_print:       total time =   17446.13 ms /   375 tokens
llama_perf_context_print:    graphs reused =          0

In this example, the 8-bit model running on 16 threads generated 375 tokens, at ~37 tokens per second (eval time).

Run a non-interactive prompt

You can also use llama-cli in one-shot mode with a prompt:

    

        
        
bin/llama-cli -m models/afm-4-5b/afm-4-5B-Q4_0.gguf -n 256 --color -no-cnv -p "Give me a brief explanation of the attention mechanism in transformer models."

This command:

Loads the 4-bit model
Disables conversation mode using -no-cnv
Sends a one-time prompt using -p
Prints the generated response and exits

The 4-bit model delivers faster generation—expect around 60 tokens per second on Graviton4. This shows how a more aggressive quantization recipe helps deliver faster performance.

Use llama-server for API access

The llama-server tool runs the model as a web server compatible with the OpenAI API format, allowing you to make HTTP requests for text generation. This is useful for integrating the model into applications or for batch processing.

Start the server

    

        
        
bin/llama-server -m models/afm-4-5b/afm-4-5B-Q4_0.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 4096

This starts a local server that:

Loads the specified model
Listens on all network interfaces (0.0.0.0)
Accepts connections on port 8080
Supports a 4096-token context window

Make an API request

Once the server is running, you can make requests using curl, or any HTTP client.

Open a new terminal on the AWS instance, and run:

    

        
        
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "afm-4-5b",
    "messages": [
      {
        "role": "user",
        "content": "Explain quantum computing in less than 100 words."
      }
    ],
    "max_tokens": 256,
    "temperature": 0.9
  }'

The response includes the model’s reply and performance metrics:

    

        
        
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Quantum computing uses quantum-mechanical phenomena like superposition and entanglement to solve complex problems much faster than classical computers. Instead of binary bits (0 or 1), quantum bits (qubits) can exist in multiple states simultaneously, allowing for parallel processing of vast combinations of possibilities. This enables quantum computers to perform certain calculations exponentially faster, particularly in areas like cryptography, optimization, and drug discovery. However, quantum systems are fragile and prone to errors, requiring advanced error correction techniques. Current quantum computers are still in early stages but show promise for transformative applications."
      }
    }
  ],
  "created": 1753876147,
  "model": "afm-4-5b",
  "system_fingerprint": "b6030-1e15bfd4",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 115,
    "prompt_tokens": 20,
    "total_tokens": 135
  },
  "id": "chatcmpl-0Zwzu03zbu77MFx4ogBsqz8E4IdxHOLU",
  "timings": {
    "prompt_n": 20,
    "prompt_ms": 68.37,
    "prompt_per_token_ms": 3.4185000000000003,
    "prompt_per_second": 292.525961679099,
    "predicted_n": 115,
    "predicted_ms": 1884.943,
    "predicted_per_token_ms": 16.390808695652172,
    "predicted_per_second": 61.00980241842857
  }
}

What’s next?

You’ve now successfully:

Run AFM-4.5B in interactive and non-interactive modes
Tested performance with different quantized models
Served the model as an OpenAI-compatible API endpoint

You can also interact with the server using Python with the OpenAI client library , enabling streaming responses, and other features.

Back

Deploy Arcee AFM-4.5B on Arm-based AWS Graviton4 with Llama.cpp

Introduction

Overview

Provision your Graviton4 environment

Configure your Graviton4 environment

Build Llama.cpp

Install Python dependencies

Download and optimize the AFM-4.5B model

Run inference with AFM-4.5B

Benchmark and evaluate the quantized models

Review what you built

Next Steps

Deploy Arcee AFM-4.5B on Arm-based AWS Graviton4 with Llama.cpp

Use llama-cli for interactive text generation

Basic usage

Example interactive session

Run a non-interactive prompt

Use llama-server for API access

Start the server

Make an API request

What’s next?