Before building and running quantized LLMs on your DGX Spark, you need to verify that your system is fully prepared for AI workloads. This includes checking your Arm-based Grace CPU configuration, confirming your operating system, ensuring the Blackwell GPU and CUDA drivers are active, and validating that the CUDA toolkit is installed. These steps provide a solid foundation for efficient LLM inference and development on Arm.
This section is organized into three main steps: verifying your CPU, checking your operating system, and confirming your GPU and CUDA toolkit setup. You’ll also find additional context and technical details throughout, should you wish to explore the platform’s capabilities more deeply.
Before running LLM workloads, it’s helpful to understand more about the CPU you’re working with. The DGX Spark uses Arm-based Grace processors, which bring some unique advantages for AI inference.
Start by checking your system’s CPU configuration:
lscpu
The output is similar to:
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Vendor ID: ARM
Model name: Cortex-X925
Model: 1
Thread(s) per core: 1
Core(s) per socket: 10
Socket(s): 1
Stepping: r0p1
CPU(s) scaling MHz: 89%
CPU max MHz: 4004.0000
CPU min MHz: 1378.0000
BogoMIPS: 2000.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 as
imddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 f
lagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt
Model name: Cortex-A725
Model: 1
Thread(s) per core: 1
Core(s) per socket: 10
Socket(s): 1
Stepping: r0p1
CPU(s) scaling MHz: 99%
CPU max MHz: 2860.0000
CPU min MHz: 338.0000
BogoMIPS: 2000.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 as
imddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 f
lagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt
Caches (sum of all):
L1d: 1.3 MiB (20 instances)
L1i: 1.3 MiB (20 instances)
L2: 25 MiB (20 instances)
L3: 24 MiB (2 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-19
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; __user pointer sanitization
Spectre v2: Not affected
Srbds: Not affected
Tsx async abort: Not affected
If you have seen this message your system is using Armv9 cores, great! These are ideal for quantized LLM workloads. The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations.
The following table provides more information about the key specifications of the Grace CPU and explains their relevance to quantized LLM inference:
| Category | Specification | Description/Impact for LLM Inference |
|---|---|---|
| Architecture | Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions |
| Core Configuration | 20 cores total - 10× Cortex-X925 (performance) + 10× Cortex-A725 (efficiency) | Heterogeneous CPU design balancing high performance and power efficiency |
| Threads per Core | 1 | Optimized for deterministic scheduling and predictable latency |
| Clock Frequency | Up to 4.0 GHz (Cortex-X925) Up to 2.86 GHz (Cortex-A725) | High per-core speed ensures strong single-thread inference for token orchestration. |
| Cache Hierarchy | L1: 1.3 MiB × 20 L2: 25 MiB × 20 L3: 24 MiB × 2 | Large shared L3 cache enhances data locality for multi-threaded inference workloads |
| Instruction Set Features** | SVE / SVE2, BF16, I8MM, AES, SHA3, SM4, CRC32 | Vector and mixed-precision instructions accelerate quantized (Q4/Q8) math operations |
| NUMA Topology | Single NUMA node (node0: 0–19) | Simplifies memory access pattern for unified memory workloads |
| Security and Reliability | Not affected by Meltdown, Spectre, Retbleed, or similar vulnerabilities | Ensures stable and secure operation for long-running inference tasks |
Its SVE2, BF16, and INT8 matrix multiplication (I8MM) capabilities are what make it ideal for quantized LLM workloads, as these provide a power-efficient foundation for both CPU-only inference and CPU-GPU hybrid processing.
You can also verify the operating system running on your DGX Spark by using the following command:
lsb_release -a
The expected output is something similar to:
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 24.04.3 LTS
Release: 24.04
Codename: noble
This shows you that your DGX Spark runs on Ubuntu 24.04 LTS, a developer-friendly Linux distribution that provides excellent compatibility with AI frameworks, compiler toolchains, and system utilities. This makes it an ideal environment for building and deploying quantized LLM workloads.
Nice work! You’ve confirmed your operating system is Ubuntu 24.04 LTS, so you can move on to the next step.
After confirming your CPU configuration, verify that the Blackwell GPU inside the GB10 Grace Blackwell Superchip is available and ready for CUDA workloads by using the following:
nvidia-smi
You will see output similar to:
Wed Oct 22 09:26:54 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GB10 On | 0000000F:01:00.0 Off | N/A |
| N/A 32C P8 4W / N/A | Not Supported | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3094 G /usr/lib/xorg/Xorg 43MiB |
| 0 N/A N/A 3172 G /usr/bin/gnome-shell 16MiB |
+-----------------------------------------------------------------------------------------+
The nvidia-smi tool reports GPU hardware specifications and provides valuable runtime information, including driver status, temperature, power usage, and GPU utilization. This information helps verify that the system is ready for AI workloads.
The table below provides more explanation of the nvidia-smi output:
| Category | Specification (from nvidia-smi) | Description / impact for LLM inference |
|---|---|---|
| GPU name | NVIDIA GB10 | Confirms the system recognizes the Blackwell GPU integrated into the Grace Blackwell Superchip |
| Driver version | 580.95.05 | Indicates that the system is running the latest driver package required for CUDA 13 compatibility |
| CUDA version | 13.0 | Confirms that the CUDA runtime supports GB10 (sm_121) and is ready for accelerated quantized LLM workloads |
| Architecture / Compute capability | Blackwell (sm_121) | Supports FP4, FP8, and BF16 Tensor Core operations optimized for LLMs |
| Memory | Unified 128 GB LPDDR5X (shared with CPU via NVLink-C2C) | Enables zero-copy data access between Grace CPU and GPU for unified inference memory space |
| Power & Thermal status | ~4W at idle, 32°C temperature | Confirms the GPU is powered on and thermally stable while idle |
| GPU-utilization | 0% (Idle) | Indicates no active compute workloads; GPU is ready for new inference jobs |
| Memory usage | Not Supported (headless GPU configuration) | DGX Spark operates in headless compute mode; display memory metrics may not be exposed |
| Persistence mode | On | Ensures the GPU remains initialized and ready for rapid inference startup |
Excellent! Your Blackwell GPU is recognized and ready for CUDA workloads. This means your system is set up for GPU-accelerated LLM inference.
To build the CUDA version of llama.cpp, the system must have a CUDA toolkit installed.
The nvcc --version command confirms that the CUDA compiler is available and compatible with CUDA 13.
This ensures that CMake can correctly detect and compile the GPU-accelerated components.
nvcc --version
You’re almost ready! Verifying the CUDA toolkit ensures you can build GPU-enabled versions of llama.cpp for maximum performance.
You will see output similar to:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:57:39_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0
The nvcc compiler is required only during the CUDA-enabled build process; it is not needed at runtime for inference.
This confirms that the CUDA 13 toolkit is installed and ready for GPU compilation. If the command is missing or reports an older version (e.g., 12.x), you should update to CUDA 13.0 or later to ensure compatibility with the Blackwell GPU (sm_121).
At this point, you have verified that:
Your DGX Spark environment is now fully prepared for the next section, where you will build and configure both CPU and GPU versions of llama.cpp, laying the foundation for running quantized LLMs efficiently on the Grace Blackwell platform.
In this entire setup section, you have achieved the following:
You’re now ready to move on to building and running quantized LLMs on your DGX Spark. The next section walks you through compiling llama.cpp for both CPU and GPU, so you can start running AI inference on this platform.