Who is this for?
This is an introductory topic for AI practitioners, performance engineers, and system architects who want to learn how to deploy and optimize quantized large language models (LLMs) on NVIDIA DGX Spark systems powered by the Grace-Blackwell (GB10) architecture.
What will you learn?
Upon completion of this Learning Path, you will be able to:
- Describe the Grace–Blackwell (GB10) architecture and its support for efficient AI inference
- Build CUDA-enabled and CPU-only versions of llama.cpp for flexible deployment
- Validate the functionality of both builds on the DGX Spark platform
- Analyze how Armv9 SIMD instructions accelerate quantized LLM inference on the Grace CPU
Prerequisites
Before starting, you will need the following:
- Access to an NVIDIA DGX Spark system with at least 15 GB of available disk space
- Familiarity with command-line interfaces and basic Linux operations
- Understanding of CUDA programming basics and GPU/CPU compute concepts
- Basic knowledge of quantized large language models (LLMs) and machine learning inference
- Experience building software from source using CMake and make