In this section, you’ll explore how the Grace CPU executes Armv9 vector instructions during quantized LLM inference.
Process Watch helps you observe Neon SIMD instruction execution on the Grace CPU and understand why SVE and SVE2 remain inactive under the current kernel configuration. This demonstrates how Armv9 vector execution works in AI workloads and shows the evolution from traditional SIMD pipelines to scalable vector computation.
First, install the required packages:
sudo apt update
sudo apt install -y git cmake build-essential libncurses-dev libtinfo-dev
Now clone and build Process Watch:
cd ~
git clone --recursive https://github.com/intel/processwatch.git
cd processwatch
./build.sh
sudo ln -s ~/processwatch/processwatch /usr/local/bin/processwatch
Process Watch requires elevated privileges to access kernel performance counters and eBPF features.
Run the following commands to enable the required permissions:
sudo setcap CAP_PERFMON,CAP_BPF=+ep ./processwatch
sudo sysctl -w kernel.perf_event_paranoid=-1
sudo sysctl kernel.unprivileged_bpf_disabled=0
These commands grant Process Watch access to performance monitoring and eBPF tracing capabilities.
Verify the installation:
./processwatch --help
You should see a usage summary similar to:
usage: processwatch [options]
options:
-h Displays this help message.
-v Displays the version.
-i <int> Prints results every <int> seconds.
-n <num> Prints results for <num> intervals.
-c Prints all results in CSV format to stdout.
-p <pid> Only profiles <pid>.
-m Displays instruction mnemonics, instead of categories.
-s <samp> Profiles instructions with a sampling period of <samp>. Defaults to 100000 instructions (1 in 100000 instructions).
-f <filter> Can be used multiple times. Defines filters for columns. Defaults to 'FPARMv8', 'NEON', 'SVE' and 'SVE2'.
-a Displays a column for each category, mnemonic, or extension. This is a lot of output!
-l Prints a list of all available categories, mnemonics, or extensions.
-d Prints only debug information.
You can run a quantized TinyLlama model on the Grace CPU to generate the instruction activity.
Use the same CPU-only llama.cpp build created in the previous section:
cd ~/llama.cpp/build-cpu/bin
./llama-cli \
-m ~/models/TinyLlama-1.1B/tinyllama-1.1b-chat-v1.0.Q8_0.gguf \
-ngl 0 \
-t 20 \
-p "Explain the benefits of vector processing in modern Arm CPUs."
Keep this terminal running while the model generates text output. You can now attach Process Watch to this active process. Once the llama.cpp process is running on the Grace CPU, attach Process Watch to observe its live instruction activity.
If only one llama-cli process is running, you can directly launch Process Watch without manually checking its PID:
sudo processwatch --pid $(pgrep llama-cli)
If multiple processes are running, first identify the correct process ID:
pgrep llama-cli
Then attach Process Watch to monitor the instruction mix of this process:
sudo processwatch --pid <LLAMA-CLI-PID>
Replace <LLAMA-CLI-PID> with the actual process ID from the previous command.
processwatch --list does not display all system processes.
It is intended for internal use and may not list user-level tasks like llama-cli.
Use pgrep or ps -ef | grep llama or htop to identify process IDs before attaching.
Process Watch displays a live instruction breakdown similar to the following:
PID NAME FPARMv8 NEON SVE SVE2 %TOTAL TOTAL
ALL ALL 5.07 15.23 0.00 0.00 100.00 29272
72930 llama-cli 5.07 15.23 0.00 0.00 100.00 29272
PID NAME FPARMv8 NEON SVE SVE2 %TOTAL TOTAL
ALL ALL 2.57 9.95 0.00 0.00 100.00 69765
72930 llama-cli 2.57 9.95 0.00 0.00 100.00 69765
PID NAME FPARMv8 NEON SVE SVE2 %TOTAL TOTAL
ALL ALL 1.90 6.61 0.00 0.00 100.00 44249
72930 llama-cli 1.90 6.61 0.00 0.00 100.00 44249
PID NAME FPARMv8 NEON SVE SVE2 %TOTAL TOTAL
ALL ALL 2.60 10.16 0.00 0.00 100.00 71049
72930 llama-cli 2.60 10.16 0.00 0.00 100.00 71049
PID NAME FPARMv8 NEON SVE SVE2 %TOTAL TOTAL
ALL ALL 2.12 7.56 0.00 0.00 100.00 68553
72930 llama-cli 2.12 7.56 0.00 0.00 100.00 68553
PID NAME FPARMv8 NEON SVE SVE2 %TOTAL TOTAL
ALL ALL 2.52 9.40 0.00 0.00 100.00 65339
72930 llama-cli 2.52 9.40 0.00 0.00 100.00 65339
PID NAME FPARMv8 NEON SVE SVE2 %TOTAL TOTAL
ALL ALL 2.34 7.76 0.00 0.00 100.00 42015
72930 llama-cli 2.34 7.76 0.00 0.00 100.00 42015
PID NAME FPARMv8 NEON SVE SVE2 %TOTAL TOTAL
ALL ALL 2.66 9.77 0.00 0.00 100.00 74616
72930 llama-cli 2.66 9.77 0.00 0.00 100.00 74616
PID NAME FPARMv8 NEON SVE SVE2 %TOTAL TOTAL
ALL ALL 2.15 7.06 0.00 0.00 100.00 58496
72930 llama-cli 2.15 7.06 0.00 0.00 100.00 58496
PID NAME FPARMv8 NEON SVE SVE2 %TOTAL TOTAL
ALL ALL 2.61 9.34 0.00 0.00 100.00 73365
72930 llama-cli 2.61 9.34 0.00 0.00 100.00 73365
PID NAME FPARMv8 NEON SVE SVE2 %TOTAL TOTAL
ALL ALL 2.52 8.37 0.00 0.00 100.00 26566
72930 llama-cli 2.52 8.37 0.00 0.00 100.00 26566
Here is an interpretation of the values:
This confirms that the Grace CPU performs quantized inference primarily using NEON.
Although the Grace CPU supports SVE and SVE2, the vector length is 16 bytes (128-bit).
Verify the current setting:
cat /proc/sys/abi/sve_default_vector_length
The output is similar to:
16
Even if you try to increase the length it cannot be changed.
echo 256 | sudo tee /proc/sys/abi/sve_default_vector_length
This behavior is expected because SVE is available but fixed at 128 bits.
Future kernel updates might introduce SVE2 instructions.
You have completed the Learning Path for analyzing large language model inference on the DGX Spark platform with Arm-based Grace CPUs and Blackwell GPUs.
Throughout this Learning Path, you have learned how to:
By completing these steps, you are now equipped to: