About this Learning Path

Skill level:	Introductory
Reading time:	1 hr
Last updated:	11 Nov 2025

Skill level:

Introductory

Reading time:

1 hr

Last updated:

11 Nov 2025

Author:	Odin Shen, Arm
Arm IP:	Cortex-A Cortex-X
Tags:	ML Linux Python C Bash llama.cpp

Author:

Odin Shen, Arm

Arm IP:

Cortex-A Cortex-X

Tags:

Linux

Python

Bash

llama.cpp

Who is this for?

This is an introductory topic for AI practitioners, performance engineers, and system architects who want to learn how to deploy and optimize quantized large language models (LLMs) on NVIDIA DGX Spark systems powered by the Grace-Blackwell (GB10) architecture.

What will you learn?

Upon completion of this Learning Path, you will be able to:

Describe the Grace–Blackwell (GB10) architecture and its support for efficient AI inference
Build CUDA-enabled and CPU-only versions of llama.cpp for flexible deployment
Validate the functionality of both builds on the DGX Spark platform
Analyze how Armv9 SIMD instructions accelerate quantized LLM inference on the Grace CPU

Prerequisites

Before starting, you will need the following:

Access to an NVIDIA DGX Spark system with at least 15 GB of available disk space
Familiarity with command-line interfaces and basic Linux operations
Understanding of CUDA programming basics and GPU/CPU compute concepts
Basic knowledge of quantized large language models (LLMs) and machine learning inference
Experience building software from source using CMake and make

Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark

Introduction

Explore Grace Blackwell architecture for efficient quantized LLM inference

Verify your Grace Blackwell system readiness for AI inference

Build the GPU version of llama.cpp on GB10

Build the CPU version of llama.cpp on GB10

Analyze CPU instruction mix using Process Watch

Next Steps

Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark

About this Learning Path

Who is this for?

What will you learn?

Prerequisites