Your numbers will vary by processor. Compare against the two baselines you recorded earlier:
| Implementation | Approx. throughput | Speedup vs. scalar |
|---|---|---|
| Scalar (original) | ~380 MB/s | 1x |
| Scalar NMAX | ~2,000 MB/s | ~5x |
| SVE | ~21,000 MB/s | ~55x |
The SVE version is roughly 10x faster than the NMAX scalar version, and about 55x faster than the original. The exact ratio depends on your SVE vector length. You can also use a Graviton3-based instance to try on a processor with 256-bit SVE vectors and compare the results. The 256-bit vector length on Graviton3 shows faster performance than the 128-bit vector length on Graviton4, but Graviton3 is slower than Graviton4 on the scalar versions.
By understanding the generated assembly, you can verify that the compiler is producing the instructions you expect.
Ask your assistant to explain the inner loop assembly code. Your prompt can be similar to:
disassemble ~/adler32-sve/adler32-test and explain the assembly code for the inner loop.
The response explains the mapping of the C code to the assembly instructions and the intrinsics used.
A partial example response is:
Summary
The entire inner loop is just 7 instructions per vector-width of bytes:
┌──────────────────┬───────────────────────────────────────┐
│ Instruction │ Purpose │
├──────────────────┼───────────────────────────────────────┤
│ `whilelo` │ Generate predicate for this iteration │
│ `ld1b` │ Predicated load of bytes │
│ `add` (scalar) │ Advance loop counter │
│ `udot` (weights) │ Weighted sum for `b` │
│ `udot` (ones) │ Simple sum for `a` │
│ `add` (vector) │ Decrement weights │
│ `cmp` + `b.hi` │ Loop control │
└──────────────────┴───────────────────────────────────────┘
To see the actual assembly, disassemble the binary:
objdump -d adler32-test | grep -A 40 "<adler32>"
Look for the WHILELT and UDOT instructions in the inner loop. If you see them, the SVE code path is active.
You’ve now completed the full optimization journey for Adler-32 on Arm Neoverse using an AI assistant and the Arm MCP server. You started with a simple scalar implementation, measured its baseline performance, and used the Arm MCP server to learn SVE concepts. You then applied the NMAX modulo-deferral technique to prepare the algorithm for vectorization. From there, you built a vector-length-agnostic SVE implementation, verified its correctness, and measured the resulting performance improvement by reading the generated assembly.
You can apply the process you followed in this Learning Path directly to other scalar loops in your own projects:
The Arm MCP server’s intrinsics reference covers all SVE and SVE2 intrinsics. As you encounter more complex loops, you can use the same question-and-answer approach to find the right intrinsics for your specific data types and operations.