Build the DGX Spark AI runtime foundation

Orchestrate a persistent local AI agent with Hermes on NVIDIA DGX Spark

Log an issue

Fork and edit

Discuss on Discord

Orchestrate a persistent local AI agent with Hermes on NVIDIA DGX Spark

Set up the base DGX Spark AI runtime

In this section, you’ll prepare the base runtime that you’ll use in the rest of the Learning Path.

You’ll install Docker, configure GPU-enabled containers, create a persistent workspace, and start the initial runtime service stack:

Ollama for local inference
Qdrant for vector memory
Open WebUI for browser-based model access

You’ll add the Hermes Agent in the next section. In this section, you’ll build the local infrastructure it depends on.

Verify the DGX Spark environment

Start by verifying that your DGX Spark system exposes the expected Arm CPU and NVIDIA GPU environment.

Check the CPU architecture:

    

        
        
uname -m

The expected output is:

Check the Linux distribution. DGX Spark runs Ubuntu 24.04:

    

        
        
lsb_release -a

Check that the NVIDIA GPU and CUDA driver stack are visible:

    

        
        
nvidia-smi

The output is similar to:

    

        
        nvidia-smi
Wed May 20 18:12:05 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |                  N/A |
| N/A   36C    P8              4W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            3565      G   /usr/lib/xorg/Xorg                      137MiB |
|    0   N/A  N/A            3776      G   /usr/bin/gnome-shell                    164MiB |
|    0   N/A  N/A            5115      G   .../8305/usr/lib/firefox/firefox        239MiB |
|    0   N/A  N/A           85940      G   ...m Performix/arm-performix-gui         54MiB |
+-----------------------------------------------------------------------------------------+

Confirm that the command shows the GPU name (NVIDIA GB10), driver version, and CUDA version. Make a note of the CUDA version, as you’ll use a matching container image when verifying GPU passthrough in the next step.

Install Docker

If you’ve not previously installed Docker, for detailed install steps, see the Docker Engine install guide .

To install Docker with one command, run:

    

        
        
curl -fsSL get.docker.com -o get-docker.sh && sh get-docker.sh

Allow your user to run Docker commands without sudo, then apply the new group membership in the current shell:

    

        
        
sudo usermod -aG docker $USER
newgrp docker

Verify Docker is working:

    

        
        
docker run hello-world

You’ll see a message confirming that Docker is installed and working.

Install NVIDIA Container Toolkit

The NVIDIA Container Toolkit allows Docker to expose the GPU to containers using the --gpus flag. Without it, containers can’t access the GPU regardless of the driver version installed on the host.

Add the NVIDIA Container Toolkit GPG key:

    

        
        
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

Add the NVIDIA Container Toolkit repository:

    

        
        
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Install the toolkit:

    

        
        
sudo apt update
sudo apt install -y nvidia-container-toolkit

Register the NVIDIA runtime with Docker. This adds the nvidia runtime to Docker’s daemon configuration so containers can request GPU access with --gpus:

    

        
        
sudo nvidia-ctk runtime configure --runtime=docker

Restart Docker to apply the configuration change:

    

        
        
sudo systemctl restart docker

Verify GPU-enabled containers

Run a CUDA validation container:

    

        
        
docker run --rm --gpus all \
nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 \
nvidia-smi

If you have not pulled this image before, Docker downloads it before running nvidia-smi. The download can take a few minutes depending on your network connection.

The output is similar to:

    

        
        Unable to find image 'nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04' locally
13.0.1-devel-ubuntu24.04: Pulling from nvidia/cuda
03f66a4525ea: Pull complete 
c03b8ec8dd33: Pull complete 
cae1e96ffa7d: Pull complete 
2cb956a72162: Pull complete 
817eab9d3c52: Pull complete 
cc43ec4c1381: Pull complete 
30fc8198a31e: Pull complete 
c88eadd06616: Pull complete 
c7ba38867e8d: Pull complete 
fd2e70db7702: Pull complete 
85eb6b47da08: Pull complete 
Digest: sha256:7d2f6a8c2071d911524f95061a0db363e24d27aa51ec831fcccf9e76eb72bc92
Status: Downloaded newer image for nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04

==========
== CUDA ==
==========

CUDA Version 13.0.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Sun May 24 10:13:04 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.159.03             Driver Version: 580.159.03     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |                  N/A |
| N/A   44C    P0             10W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

If the command prints GPU information from inside the container, Docker GPU passthrough is working.

Docker can now run GPU-accelerated containers on DGX Spark.

Create the persistent workspace

Create the project directory:

    

        
        
mkdir -p ~/dgx-hermes-agent
cd ~/dgx-hermes-agent

Create the directory structure used by the runtime:

    

        
        
mkdir -p \
workspace/inbox \
workspace/memory \
workspace/logs \
workspace/processed \
workspace/config \
models \
compose \
qdrant

The workspace now looks like this:

    

        
        
dgx-hermes-agent/
|-- compose/
|-- models/
|-- qdrant/
|-- workspace/
|   |-- config/
|   |-- inbox/
|   |-- logs/
|   |-- memory/
|   `-- processed/

The workspace/ directory is shared across runtime services. Hermes will later monitor workspace/inbox/, write generated artifacts to workspace/memory/, and read runtime policies from workspace/config/.

Create the runtime service stack

Create and edit the file ~/dgx-hermes-agent/compose/docker-compose.yml.

Add the following content:

    

        
        
services:

  ollama:
    image: ollama/ollama:latest
    container_name: ollama

    ports:
      - "11434:11434"

    dns:
      - 8.8.8.8
      - 1.1.1.1

    volumes:
      - ../models:/root/.ollama
      - ../workspace:/workspace

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

    environment:
      - NVIDIA_VISIBLE_DEVICES=all

    restart: unless-stopped

  qdrant:
    image: qdrant/qdrant:latest
    container_name: qdrant

    ports:
      - "6333:6333"
      - "6334:6334"

    volumes:
      - ../qdrant:/qdrant/storage

    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui

    ports:
      - "3000:8080"

    environment:
      - OLLAMA_BASE_URL=http://ollama:11434

    volumes:
      - open-webui:/app/backend/data

    depends_on:
      - ollama

    restart: unless-stopped

volumes:
  open-webui:

This Compose stack creates the first three runtime services. You’ll add Hermes as a fourth service later. The explicit DNS settings in the Ollama service help the container reach the model registry reliably. You’ll verify this in the networking step.

Understand the role of each runtime service

The initial stack separates model execution, memory storage, and user interaction.

Service	Role
Ollama	Runs local language and embedding models
Qdrant	Stores persistent vector memory
Open WebUI	Provides a local browser interface to Ollama

The models/ directory persists Ollama models on the host. The qdrant/ directory persists vector database storage. The workspace/ directory is mounted into Ollama now and will also be mounted into Hermes later.

Ollama doesn’t orchestrate workspace files by itself. The workspace mount verification step confirms shared storage access. Hermes will become the service that reads workspace files and decides when to call Ollama.

Start the runtime stack

If Ollama is already installed as a host service, stop it to avoid port conflicts:

    

        
        
sudo systemctl stop ollama
sudo systemctl disable ollama

Start the container stack:

    

        
        
cd ~/dgx-hermes-agent/compose
docker compose up -d

Note

The first docker compose up -d run can take several minutes, depending on your network speed, because Docker needs to pull the service images.

Verify that the containers are running:

    

        
        
docker ps

The output is similar to:

    

        
        NAME         IMAGE                                COMMAND               SERVICE      CREATED         STATUS                            PORTS
ollama       ollama/ollama:latest                 "/bin/ollama serve"   ollama       5 seconds ago   Up 4 seconds                      0.0.0.0:11434->11434/tcp, [::]:11434->11434/tcp
open-webui   ghcr.io/open-webui/open-webui:main   "bash start.sh"       open-webui   4 seconds ago   Up 4 seconds (health: starting)   0.0.0.0:3000->8080/tcp, [::]:3000->8080/tcp
qdrant       qdrant/qdrant:latest                 "./entrypoint.sh"     qdrant       5 seconds ago   Up 4 seconds                      0.0.0.0:6333-6334->6333-6334/tcp, [::]:6333-6334->6333-6334/tcp

Validate the base DGX Spark AI runtime

After starting the runtime, verify that it works as expected.

Verify container networking

Open a shell in the Ollama container:

    

        
        
docker exec -it ollama bash

You might see a warning such as groups: cannot find name for group ID 992. It’s a harmless warning that appears when the container’s /etc/group file has no entry for your host user’s GID. The shell opens normally and all commands work as expected.

Verify DNS resolution:

    

        
        
getent hosts registry.ollama.ai

The output is similar to:

    

        
        root@367b013fd34c:/# getent hosts registry.ollama.ai
2606:4700:3036::6815:4be3 registry.ollama.ai
2606:4700:3034::ac43:b6e5 registry.ollama.ai

Exit the container shell:

The DNS settings in the Compose file help the container reach the Ollama model registry reliably.

Pull local models

Open a shell in the Ollama container:

    

        
        
docker exec -it ollama bash

Pull the language model:

    

        
        
ollama pull qwen2.5:7b

Pull the embedding model:

    

        
        
ollama pull nomic-embed-text

Exit the container:

These model names are used throughout the examples, so make note of them. The architecture supports other suitable models.

Model	Purpose
`qwen2.5:7b`	Local chat, summarization, reasoning
`nomic-embed-text`	Embedding generation for semantic memory

Because the models/ directory is mounted into the container as a volume, downloaded models are stored on the host at ~/dgx-hermes-agent/models/. The models persist even if the container is removed or recreated.

Verify local inference

Open a shell in the Ollama container:

    

        
        
docker exec -it ollama bash

Run the local model:

    

        
        
ollama run qwen2.5:7b

Enter a short prompt, such as:

    

        
        
Summarize the role of CPU orchestration for an AI agent in one sentence.

After the model responds, type /bye to exit the interactive model session, then type exit to leave the container shell.

You can also monitor GPU activity from another terminal when the model is running:

During inference, you’ll see GPU utilization rise on the Blackwell GPU as the model processes the prompt and generates tokens. The change in GPU utilization shows that the model is running on the GPU rather than falling back to the CPU.

This step shows that local inference is available before Hermes begins calling Ollama programmatically.

Verify Open WebUI

Open a browser and navigate to:

    

        
        
http://localhost:3000

On first launch, Open WebUI presents a setup screen asking for a name and email address to create a local admin account. The email can be a placeholder and no data leaves your system. Enter any values and continue to the main interface.

To verify that Ollama is reachable from the host, navigate to:

    

        
        
http://localhost:11434

If Ollama is running, the browser displays the message Ollama is running. The message confirms that the Ollama container is accessible on the expected port. Open WebUI connects to Ollama using the internal Docker network address http://ollama:11434, but from the host you use localhost:11434.

Use Open WebUI to confirm that the local model is listed and available for chat. Open WebUI is not used in the agent workflow in the sections that follow. Hermes calls Ollama directly through its API, so Open WebUI serves only as a convenient way to validate the inference stack before the agent takes over.

Verify Qdrant

Open the Qdrant dashboard:

    

        
        
http://localhost:6333/dashboard

Image Alt Text:Qdrant dashboard running locally before the workspace_memory collection is created Qdrant Dashboard

Qdrant is running, but it doesn’t contain the workspace_memory collection yet. Hermes creates that collection later when you add persistent memory.

Verify the shared workspace mount

Open another terminal on your DGX Spark system and create a test file on the host. Don’t run this command inside a container.

    

        
        
echo "Arm CPUs orchestrate persistent AI workflows." \
> ~/dgx-hermes-agent/workspace/inbox/test.txt

Verify that the shared mount is visible by opening a shell in the Ollama container:

    

        
        
docker exec -it ollama bash

Inside the container, run:

    

        
        
ls -l /workspace
cat /workspace/inbox/test.txt

The output is similar to:

    

        
        drwxrwxr-x 2 1001 1001 4096 May 20 18:16 config
drwxrwxr-x 2 1001 1001 4096 May 20 18:37 inbox
drwxrwxr-x 2 1001 1001 4096 May 20 18:16 logs
drwxrwxr-x 2 1001 1001 4096 May 20 18:16 memory
drwxrwxr-x 2 1001 1001 4096 May 20 18:16 processed

The file contains the following:

    

        
        Arm CPUs orchestrate persistent AI workflows.

Exit the container:

What you’ve accomplished and what’s next

You’ve built the runtime foundation for the persistent local AI system. The DGX Spark environment now has Docker, Docker Compose, NVIDIA Container Toolkit, GPU-enabled containers, persistent workspace storage, and the initial Ollama, Qdrant, and Open WebUI services.

You’ve also verified shared workspace access, local inference, and the fixed model setup used by the later sections.

Next, you’ll add Hermes Agent as the persistent orchestration runtime.

Back

Orchestrate a persistent local AI agent with Hermes on NVIDIA DGX Spark

Introduction

Explore persistent AI runtime architecture on NVIDIA DGX Spark

Build the DGX Spark AI runtime foundation

Deploy Hermes Agent as an orchestration runtime

Add local LLM inference to Hermes Agent

Build persistent semantic memory for Hermes Agent

Add semantic retrieval and contextual reasoning to Hermes Agent

Add autonomous workspace cognition to Hermes Agent

Next Steps

Orchestrate a persistent local AI agent with Hermes on NVIDIA DGX Spark

Set up the base DGX Spark AI runtime

Verify the DGX Spark environment

Install Docker

Install NVIDIA Container Toolkit

Verify GPU-enabled containers

Create the persistent workspace

Create the runtime service stack

Understand the role of each runtime service

Start the runtime stack

Validate the base DGX Spark AI runtime

Verify container networking

Pull local models

Verify local inference

Verify Open WebUI

Verify Qdrant

Verify the shared workspace mount

What you’ve accomplished and what’s next