Compare the results

Your numbers will vary by processor. Compare against the two baselines you recorded earlier:

ImplementationApprox. throughputSpeedup vs. scalar
Scalar (original)~380 MB/s1x
Scalar NMAX~2,000 MB/s~5x
SVE~21,000 MB/s~55x

The SVE version is roughly 10x faster than the NMAX scalar version, and about 55x faster than the original. The exact ratio depends on your SVE vector length. You can also use a Graviton3-based instance to try on a processor with 256-bit SVE vectors and compare the results. The 256-bit vector length on Graviton3 shows faster performance than the 128-bit vector length on Graviton4, but Graviton3 is slower than Graviton4 on the scalar versions.

Ask AI about the inner loop assembly code

By understanding the generated assembly, you can verify that the compiler is producing the instructions you expect.

Ask your assistant to explain the inner loop assembly code. Your prompt can be similar to:

    

        
        
disassemble ~/adler32-sve/adler32-test and explain the assembly code for the inner loop.  

    

The response explains the mapping of the C code to the assembly instructions and the intrinsics used.

A partial example response is:

    

        
        Summary                                                                                                                           
                                                                                                                                    
  The entire inner loop is just 7 instructions per vector-width of bytes:                                                           
                                                                                                                                    
  ┌──────────────────┬───────────────────────────────────────┐                                                                      
  │ Instruction      │ Purpose                               │                                                                      
  ├──────────────────┼───────────────────────────────────────┤                                                                      
  │ `whilelo`        │ Generate predicate for this iteration │                                                                      
  │ `ld1b`           │ Predicated load of bytes              │                                                                      
  │ `add` (scalar)   │ Advance loop counter                  │                                                                      
  │ `udot` (weights) │ Weighted sum for `b`                  │                                                                      
  │ `udot` (ones)    │ Simple sum for `a`                    │                                                                      
  │ `add` (vector)   │ Decrement weights                     │                                                                      
  │ `cmp` + `b.hi`   │ Loop control                          │                                                                      
  └──────────────────┴───────────────────────────────────────┘

        
    

To see the actual assembly, disassemble the binary:

    

        
        
objdump -d adler32-test | grep -A 40 "<adler32>"

    

Look for the WHILELT and UDOT instructions in the inner loop. If you see them, the SVE code path is active.

Note You can use your AI assistant to debug any issues or clarify performance. However, it is easy to fall into an endless loop of trial and error as today's assistants can easily make things worse.

What you’ve accomplished and what’s next

You’ve now completed the full optimization journey for Adler-32 on Arm Neoverse using an AI assistant and the Arm MCP server. You started with a simple scalar implementation, measured its baseline performance, and used the Arm MCP server to learn SVE concepts. You then applied the NMAX modulo-deferral technique to prepare the algorithm for vectorization. From there, you built a vector-length-agnostic SVE implementation, verified its correctness, and measured the resulting performance improvement by reading the generated assembly.

You can apply the process you followed in this Learning Path directly to other scalar loops in your own projects:

  1. Establish a correctness test and a performance baseline before changing anything
  2. Ask your AI assistant to guide you and keep explaining along the way
  3. Use the Arm MCP server to look up the specific intrinsics you need, one concept at a time
  4. Validate correctness before measuring performance
  5. Compare against each intermediate version to understand where the speedup comes from

The Arm MCP server’s intrinsics reference covers all SVE and SVE2 intrinsics. As you encounter more complex loops, you can use the same question-and-answer approach to find the right intrinsics for your specific data types and operations.

Back
Next