About this Learning Path

Skill level:	Advanced
Reading time:	1 hr
Last updated:	18 Dec 2025

Skill level:

Advanced

Reading time:

1 hr

Last updated:

18 Dec 2025

Author:	Odin Shen, Arm
Arm IP:	Cortex-A
Tags:	ML Linux Python llama.cpp Hugging Face

Author:

Odin Shen, Arm

Arm IP:

Cortex-A

Tags:

Linux

Python

llama.cpp

Hugging Face

Who is this for?

This is an advanced topic for developers who want to build a Retrieval-Augmented Generation (RAG) pipeline on the NVIDIA DGX Spark platform. You'll learn how Arm-based Grace CPUs handle document retrieval and orchestration, while Blackwell GPUs speed up large language model inference using the open-source llama.cpp REST server. This is a great fit if you're interested in combining Arm CPU management with GPU-accelerated AI workloads.

What will you learn?

Upon completion of this Learning Path, you will be able to:

Describe how a RAG system combines document retrieval and language model generation
Deploy a hybrid CPU-GPU RAG pipeline on the GB10 platform using open-source tools
Use the llama.cpp REST Server for GPU-accelerated inference with CPU-managed retrieval
Build a reproducible RAG application that demonstrates efficient hybrid computing

Prerequisites

Before starting, you will need the following:

An NVIDIA DGX Spark system with at least 15 GB of available disk space

Build a RAG pipeline on Arm-based NVIDIA DGX Spark

Introduction

Explore building a RAG pipeline on Arm-based Grace–Blackwell systems

Configure the RAG development environment and models

Add documents to the RAG vector database

Build and run the RAG pipeline

Monitor unified memory performance

Next Steps

Build a RAG pipeline on Arm-based NVIDIA DGX Spark

About this Learning Path

Who is this for?

What will you learn?

Prerequisites