Optimize SIMD code with vectorization-friendly data layout: Increase complexity

Optimize SIMD code with vectorization-friendly data layout

Log an issue

Fork and edit

Discuss on Discord

Optimize SIMD code with vectorization-friendly data layout

What if you have more complex calculations to perform?

For example, you want to add a bounding box to your simulation so the particles bounce on the borders.

You can modify the previous code to check the boundaries.

Copy your simulation1.c to a new file named simulation2.c so you can continue modifications and save the original.

    

        
        
cp simulation1.c simulation2.c

Edit simulation2.c and replace the simulate_objects() function with the one below. Also add the new ctr4 struct and box constant below.

    

        
        
typedef struct ctr4 {
  uint32_t x, y, z, t;
} ctr4_t;

const vec4_t box = { 10.0f, 10.0f, 10.0f, 10.0f };

void simulate_objects(object_t *objects, float duration, float step) {

  float current_time = 0;
  ctr4_t collisions = { 0, 0, 0, 0 };

  while (current_time < duration) {
    for (size_t i=0; i < N; i++) {
      objects[i].position.x += objects[i].velocity.x * step;
      objects[i].position.y += objects[i].velocity.y * step;
      objects[i].position.z += objects[i].velocity.z * step;
      objects[i].position.t += objects[i].velocity.t * step;

      // check the boundaries
      if (objects[i].position.x < -box.x || objects[i].position.x > box.x) {
        objects[i].velocity.x = -objects[i].velocity.x;
        collisions.x++;
      }
      if (objects[i].position.y < -box.y || objects[i].position.y > box.y) {
        objects[i].velocity.y = -objects[i].velocity.y;
        collisions.y++;
      }
      if (objects[i].position.z < -box.z || objects[i].position.z > box.z) {
        objects[i].velocity.z = -objects[i].velocity.z;
        collisions.z++;
      }
      if (objects[i].position.t < -box.t || objects[i].position.t > box.t) {
        objects[i].velocity.t = -objects[i].velocity.t;
        collisions.t++;
      }
    }
    current_time += step;
  }
  printf("Total border collisions: x: %d, y: %d, z: %d\n", collisions.x, collisions.y, collisions.z);
}

This is a scalar approach, but the code still uses the vec4 struct and the boundary checking is even done on the fourth element. This is not a waste of cycles as the compiler can find an opportunity to autovectorize this, but you should be able to verify this for yourself.

First, compile the code with -O2:

    

        
        
gcc -O2 -Wall simulation2.c -o simulation2

Run the new executable:

    

        
        
 ./simulation2

The resulting output is:

    

        
        Total border collisions: x: 250123, y: 249711, z: 249844
elapsed time: 36.700929

Next, compile with -O3:

    

        
        
gcc -O3 -Wall simulation2.c -o simulation2

Run again:

    

        
        
./simulation2

Similar to last time, the output is:

    

        
        Total border collisions: x: 250123, y: 249711, z: 249844
elapsed time: 28.926221

Using -O3 reduces the execution time by about 21%. This is good, but you might be expecting more. Checking the assembly output will tell you that it’s not as good as you expected.

Check the assembly output for -O2 . It doesn’t show any SIMD instructions, which is expected with -O2.

Now observe the assembly output for -O3 and in particular these lines:

    

        
        
     c48:       6ea0e4ea        fcmgt   v10.4s, v7.4s, v0.4s
     c4c:       6eb0e400        fcmgt   v0.4s, v0.4s, v16.4s
     c50:       6ea2e4e9        fcmgt   v9.4s, v7.4s, v2.4s
     c54:       6ea5e4eb        fcmgt   v11.4s, v7.4s, v5.4s
     c58:       6ea4e4ec        fcmgt   v12.4s, v7.4s, v4.4s
     c5c:       6eb0e442        fcmgt   v2.4s, v2.4s, v16.4s
     c60:       6eb0e4a5        fcmgt   v5.4s, v5.4s, v16.4s
     c64:       6eb0e484        fcmgt   v4.4s, v4.4s, v16.4s
     c68:       6ea1e4ed        fcmgt   v13.4s, v7.4s, v1.4s
     c6c:       6ea3e4ee        fcmgt   v14.4s, v7.4s, v3.4s
     c70:       6ea6e4ef        fcmgt   v15.4s, v7.4s, v6.4s

Even without looking at the rest of the code, this indicates something is wrong with this optimization.

This code is one of the cases that the compilers cannot yet autovectorize. Neither GCC nor Clang can autovectorize this code to the quality that hand-written SIMD code can reach.

You should not blame the compilers for failing to autovectorize a piece of code. This is an ongoing process and the compilers are already quite proficient at autovectorizing many loops. It’s always worth the effort investigating what the compiler can do for you and only revert to hand-written code when you are not satisfied with the result.

In the next section you will see a manual optimization approach.

Back

Optimize SIMD code with vectorization-friendly data layout

Introduction

What exactly is data layout?

Improve data alignment

Increase complexity

Write hand optimized SIMD code

Structure of arrays

Migrate to the Scalable Vector Extension (SVE)

Next Steps

Optimize SIMD code with vectorization-friendly data layout