Introduction
Explore persistent AI runtime architecture on NVIDIA DGX Spark
Build the DGX Spark AI runtime foundation
Deploy Hermes Agent as an orchestration runtime
Add local LLM inference to Hermes Agent
Build persistent semantic memory for Hermes Agent
Add semantic retrieval and contextual reasoning to Hermes Agent
Add autonomous workspace cognition to Hermes Agent
Next Steps
In this section, you’ll connect Hermes Agent to Ollama.
This step turns Hermes from a file watcher into an inference orchestrator. Hermes still controls the workflow, but it now sends document content to Ollama and uses the model response as part of the runtime output.
The runtime already watches workspace/inbox/ and reacts when a file is created. You’ll now extend that workflow so Hermes sends file content to a local large language model (LLM) and prints an AI-generated summary.
The workflow becomes:
workspace/inbox document
-> Hermes on_created() handler
-> Hermes calls Ollama
-> Local LLM summary
Connecting Hermes Agent to Ollama introduces the first GPU-accelerated step in the persistent runtime.
Hermes reaches Ollama through the Docker Compose network.
In the Hermes Compose service, you added this environment variable earlier:
environment:
- OLLAMA_HOST=http://ollama:11434
Inside the Docker network, the service name ollama resolves to the Ollama container. Hermes uses this URL when it creates the Ollama Python client.
Verify that the Ollama container is running:
cd ~/dgx-hermes-agent/compose
docker ps
You’ll see both ollama and hermes running.
You pulled qwen2.5:7b when you built the runtime foundation. Run an inference test to confirm that the model is still available inside the Ollama container:
docker exec -it ollama ollama run qwen2.5:7b
Enter a short prompt:
Summarize persistent AI runtimes in one sentence.
Type /bye to exit the model session.
Open and edit the file ~/dgx-hermes-agent/hermes/agent.py.
Replace the file with the following version:
import os
import time
import ollama
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
WATCH_DIR = "/workspace/inbox"
OLLAMA_HOST = os.getenv(
"OLLAMA_HOST",
"http://ollama:11434"
)
client = ollama.Client(host=OLLAMA_HOST)
class WorkspaceHandler(FileSystemEventHandler):
def on_created(self, event):
if event.is_directory:
return
print(f"\n[Agent] New file detected:")
print(event.src_path)
summarize_file(event.src_path)
def summarize_file(path):
try:
with open(path, "r") as f:
content = f.read()
print("\n[Agent] Running local inference...")
response = client.chat(
model="qwen2.5:7b",
messages=[
{
"role": "system",
"content": (
"You are a local AI workspace assistant. "
"Summarize the document in 3 concise bullet points."
)
},
{
"role": "user",
"content": content[:4000]
}
]
)
summary = response["message"]["content"]
print("\n[Agent] AI Summary:")
print(summary)
except Exception as e:
print(f"[Agent] Error: {e}")
if __name__ == "__main__":
print("\n[Hermes Agent] Starting workspace watcher...")
print(f"[Hermes Agent] Monitoring: {WATCH_DIR}")
observer = Observer()
observer.schedule(
WorkspaceHandler(),
WATCH_DIR,
recursive=False
)
observer.start()
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join()
The updated agent imports the ollama package and reads OLLAMA_HOST from the container environment, with http://ollama:11434 as the fallback. An ollama.Client is created at startup so the connection is ready before any files arrive.
When a new file is detected, summarize_file() sends the content to qwen2.5:7b using the chat API with a system prompt that requests a three-point bullet summary. The input is capped at 4000 characters to keep requests manageable and avoid sending very large files to the model.
Rebuild the Hermes container:
cd ~/dgx-hermes-agent/compose
docker compose build hermes
Restart the runtime:
docker compose up -d
Follow the Hermes logs:
docker logs -f hermes
Expected startup output:
[Hermes Agent] Starting workspace watcher...
[Hermes Agent] Monitoring: /workspace/inbox
Leave this terminal open with the log stream running and open a second terminal for the next step.
After rebuilding Hermes, verify that it’s connected to Ollama as expected.
Create a new file in another terminal. Write the file outside the inbox first, then move it into workspace/inbox/ so Hermes sees a completed file.
cat > /tmp/ai-runtime-note.txt <<'EOF'
Persistent AI systems are not only prompt-response applications.
They run as long-lived local services that monitor events, coordinate
runtime workflows, store memory, and use GPU acceleration when model
inference is required.
EOF
mv /tmp/ai-runtime-note.txt \
~/dgx-hermes-agent/workspace/inbox/ai-runtime-note.txt
Return to the first terminal running Hermes logs to see the Hermes log output.
The output is similar to:
[Agent] New file detected:
/workspace/inbox/ai-runtime-note.txt
[Agent] Running local inference...
[Agent] AI Summary:
- Persistent AI systems function beyond simple prompt-response interactions, operating as ongoing local services.
- These systems monitor events, manage workflows, and maintain stored memory for extended periods.
- They utilize GPU acceleration during model inference to enhance performance.
The generated summary text will vary because it is produced by the local model.
To observe GPU activity during inference, keep a terminal open with the Hermes log stream running. In another terminal, schedule a new file to be created after a short delay, then start nvtop immediately:
(
sleep 5
cat > /tmp/gpu-inference-test.txt <<'EOF'
DGX Spark combines Arm CPU orchestration with NVIDIA GPU acceleration.
The CPU coordinates persistent services, while the GPU accelerates local
language model inference and summarization workloads.
EOF
mv /tmp/gpu-inference-test.txt \
~/dgx-hermes-agent/workspace/inbox/gpu-inference-test.txt
) &
nvtop
The background command creates the file after five seconds, giving nvtop time to start before Ollama begins inference. During summarization, nvtop shows GPU activity from the Ollama model runtime. Watch the first terminal running Hermes logs to see the Hermes log output as inference runs.
Press q to quit nvtop after reviewing the GPU activity.
The Arm Grace CPU coordinates the full workflow: watching the workspace, handling filesystem events, reading files, preparing model requests, and sending API calls to Ollama.
The Blackwell GPU accelerates the model workload, running LLM inference, generating tokens, and producing the summary. This pattern repeats throughout the Learning Path. Hermes orchestrates and Ollama executes.
You’ve now extended Hermes with local LLM inference through the Ollama Python SDK and the OLLAMA_HOST runtime setting. New files in the workspace can now trigger summarization with qwen2.5:7b, and you can validate GPU activity with nvtop.
The runtime has moved from file detection to event-driven AI summarization.
Next, you’ll add persistent semantic memory with embeddings and Qdrant.
Before moving to the next section, press Ctrl+C in the first terminal running Hermes logs to stop the Hermes log stream. In the next section, you’ll rebuild the Hermes container and run docker logs -f hermes again.