Introduction
Explore persistent AI runtime architecture on NVIDIA DGX Spark
Build the DGX Spark AI runtime foundation
Deploy Hermes Agent as an orchestration runtime
Add local LLM inference to Hermes Agent
Build persistent semantic memory for Hermes Agent
Add semantic retrieval and contextual reasoning to Hermes Agent
Add autonomous workspace cognition to Hermes Agent
Next Steps
In this Learning Path, you’ll build a persistent local AI runtime on NVIDIA DGX Spark . The implementation is validated on DGX Spark, but the architecture can also apply to other Arm Cortex-A platforms that can run containerized services and local AI runtimes.
The final system isn’t a single chatbot process. It’s a set of local services that run continuously, share a workspace, react to file events, generate summaries, create embeddings, store vector memory, retrieve context, and periodically reason about the state of the workspace.
The core idea is that AI systems are orchestration systems, not just inference systems.
DGX Spark is well suited to this type of workload because it combines Arm CPU orchestration with local GPU acceleration. In the Grace Blackwell architecture , the Arm Grace CPU coordinates background services, filesystem events, scheduling, document processing, metadata handling, and service-to-service communication.
The Blackwell GPU accelerates local LLM inference, token generation, summarization, and embedding generation.
By the end of this Learning Path, you’ll have a local runtime with these capabilities:
| Capability | Runtime component |
|---|---|
| Local LLM inference | Ollama |
| Persistent vector memory | Qdrant |
| Workspace orchestration | Hermes Agent |
| Browser-based interaction | Open WebUI |
| Semantic retrieval | Hermes Agent + Qdrant + Ollama |
| Autonomous workspace cognition | Hermes Agent + Ollama |
The runtime uses four containerized services:
These services communicate over a local Docker network and share a persistent workspace on the host.
+------------------+ HTTP API +------------------------------+
| Open WebUI | -----------------> | Ollama Container |
| User interface | | Local inference runtime |
+------------------+ +--------------^---------------+
|
| inference
| embeddings
|
+----------------+---------------+
| Hermes Container |
| CPU-side orchestration |
+----------+-------------+-------+
| |
| files | vectors
| events | metadata
v v
+----------+----+ +----+----------+
| Shared | | Qdrant |
| workspace | | vector memory |
+---------------+ +---------------+
The important architectural pattern is separation of responsibilities. Each service has a narrow role, and Hermes coordinates the overall workflow.
| Layer | Service | Purpose |
|---|---|---|
| Interaction layer | Open WebUI | Provides browser-based access to local models |
| Inference layer | Ollama | Runs local language and embedding models |
| Memory layer | Qdrant | Stores and searches vector memory |
| Orchestration layer | Hermes Agent | Watches files, schedules work, coordinates services |
Hermes is the orchestration runtime you’ll build.
Hermes runs as a persistent Python service inside a container. It watches the shared workspace, detects new files, and reads documents. It sends requests to Ollama, stores memory in Qdrant, performs semantic retrieval, and later generates autonomous workspace summaries. Hermes doesn’t run the language model itself. Instead, it coordinates AI workflows across local services.
The Hermes runtime is the main CPU-side workload in the system. The Arm CPU keeps the runtime alive, schedules background loops, tracks file events, moves data between services, and manages runtime state.
Ollama provides the local inference runtime. It’s a convenient way to run local models and expose a simple API, but the architecture isn’t limited to Ollama.
Conceptually, Ollama is one possible inference backend. Hermes can orchestrate any local or remote inference service that exposes a compatible API, such as llama.cpp server, vLLM, a custom PyTorch service, or another model runtime.
Hermes uses Ollama for two types of model calls:
qwen2.5:7b
nomic-embed-text
The chat model summarizes files, answers questions over retrieved memory, and generates workspace-level insights. The embedding model converts text into vectors so Qdrant can store and search semantic memory. Ollama doesn’t watch files, manage memory, or decide when work happens. Hermes calls it when the workflow requires inference.
Qdrant provides persistent vector memory.
Hermes stores document embeddings in a Qdrant collection named workspace_memory. Each stored point includes a vector and payload metadata, such as the document path, generated summary, and source content excerpt. Qdrant stores vectors and returns semantically similar memories when Hermes performs a retrieval query. It doesn’t run inference or decide when retrieval happens.
Open WebUI provides a local browser interface for interacting with the Ollama runtime.
Use it to confirm that local models are available, test prompts, or explore the inference runtime in a browser. The persistent AI runtime is still coordinated by Hermes.
The services use a shared workspace mounted into the containers.
The workspace structure is:
workspace/
|-- inbox/
|-- memory/
|-- logs/
|-- processed/
`-- config/
Each directory has a specific purpose:
| Directory | Purpose |
|---|---|
workspace/inbox/ | Input files monitored by Hermes |
workspace/memory/ | Generated memory artifacts and workspace summaries |
workspace/logs/ | Runtime logs and diagnostics |
workspace/processed/ | Optional location for processed files |
workspace/config/ | Runtime policy configuration |
The shared workspace is what turns isolated containers into a coordinated local AI runtime. Hermes can observe files created on the host, use Ollama to process them, store memory in Qdrant, and write results back to persistent storage.
Persistent AI systems are long-running systems. They do not wait for a single prompt and then exit. They monitor runtime state and react when something changes.
Hermes starts with a filesystem watcher:
[New document] -> [Filesystem event] -> [Hermes orchestration] -> [Document processing]
As you add capabilities, the workflow grows:
[New document]
-> [CPU watcher]
-> [Document parsing]
-> [GPU summarization]
-> [GPU embedding]
-> [Qdrant memory]
This event-driven design is important because it shows how AI systems become continuous local runtimes. The model is only one part of the system. The surrounding runtime decides when to call the model, what context to provide, where to store results, and how later workflows can reuse those results.
Semantic memory gives the runtime a way to retain information over time.
| Flow | Runtime path |
|---|---|
| Store memory | [Document] -> [Summary] -> [Embedding] -> [Qdrant vector storage] |
| Retrieve memory | [Question] -> [Query embedding] -> [Similarity search] -> [Contextual response] |
This is different from storing plain text files and searching for keywords. Vector search allows the runtime to retrieve content based on semantic similarity. For example, a question about “CPU scheduling” can retrieve a document that discusses “runtime orchestration” even if the exact words are different.
In the final stage of this Learning Path, you’ll add autonomous workspace cognition.
Instead of responding only when a new file appears or when a query is submitted, Hermes periodically reviews the accumulated semantic memory and generates a workspace-level summary.
The cognition workflow is:
[Semantic memory] -> [Scheduled analysis] -> [Workspace summary] -> [Runtime insights]
Runtime behavior is controlled by a configuration file:
/workspace/config/runtime.json
This file allows the runtime to adjust settings such as supported file extensions, retrieval depth, summary interval, and summary output path without hardcoding every behavior into the agent.
This Learning Path highlights heterogeneous AI computing. The CPU and GPU both matter, but they perform different roles.
The Arm Grace CPU handles the persistent runtime side: watching the filesystem, scheduling background loops, coordinating services, parsing documents, managing metadata, and keeping long-running processes alive. The Blackwell GPU handles model execution, including LLM inference, token generation, summarization, embedding generation, and contextual reasoning.
This separation is central to the architecture. The GPU accelerates model-heavy operations, while the CPU keeps the runtime organized and continuously operating.
You’ve now learned about the persistent AI runtime architecture that you’ll be building on NVIDIA DGX Spark. You’ve explored the different components and workflows of this architecture, and how the CPU and GPU perform different roles.
Next, you’ll build the DGX Spark runtime foundation: Docker, GPU-enabled containers, the shared workspace, and the initial Ollama, Qdrant, and Open WebUI services.