Optimize SIMD code with vectorization-friendly data layout: What exactly is data layout?

Optimize SIMD code with vectorization-friendly data layout

Log an issue

Fork and edit

Discuss on Discord

Optimize SIMD code with vectorization-friendly data layout

Data layout basics

Data layout refers to the memory representation of the variables that belong to a data structure or an object. It describes the way the variables are stored in memory.

For example, assume you are creating a 3D simulation program or a game where you want to model the movement of thousands of particles and you have the following data structures for the 3D objects:

    

        
        
// Helper struct of a 3D vector with x, y, z, coordinates
typedef struct vec3 {
  float x, y, z;
} vec3_t;

// The object struct
typedef struct object {
  uint32_t id;
  float weight;
  vec3_t position;
  vec3_t velocity;
  void *model_data;
} object_t;

These are example data structures. Actual simulations have more information per object. Extra information may include additional identifiers, volume information, a pointer to a 3D model for display purposes, boundary information, or even representations of quantum wavefunctions. The details are beyond the scope of this Learning Path, but most software uses the same principle, regardless of whether it’s a game or a molecular dynamics simulation.

The core of the program is the main for loop, which changes the positions and velocities of the objects in tiny amounts and updates them multiple times per second, thereby giving the notion of movement.

In this Learning Path, you will compile and run loops and measure performance.

Use a text editor to copy the code below and save it in a file named simulation1.c

    

        
        
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <sys/time.h>
#include <unistd.h>

// Helper struct of a 3D vector with x, y, z, coordinates
typedef struct vec3 {
  float x, y, z;
} vec3_t;

// The object struct
typedef struct object {
  uint32_t id;
  float weight;
  vec3_t position;
  vec3_t velocity;
  void *model_data;
} object_t;

// Helper function to return a random float number from 0.0f to a.
float randf(float a) {
  return ((float)rand()/(float)(RAND_MAX)) * a;
}

#define N           100000 // number of particles
#define SECONDS     100    // duration in seconds
#define STEPSPERSEC 1000   // steps per second or time resolution

void init_objects(object_t *objects) {
  // Give initial speeds and positions
  for (size_t i=0; i < N; i++) {
    objects[i].id = i;
    objects[i].weight = randf(2.0f);
    // initial positions in box [-10, 10], velocity in [-1, 1]
    objects[i].position.x = randf(20.0) - 10.0f;
    objects[i].position.y = randf(20.0) - 10.0f;
    objects[i].position.z = randf(20.0) - 10.0f;
    objects[i].velocity.x = randf(2.0) - 1.0f;
    objects[i].velocity.y = randf(2.0) - 1.0f;
    objects[i].velocity.z = randf(2.0) - 1.0f;

  }
}

void simulate_objects(object_t *objects, float duration, float step) {

  float current_time = 0;
  while (current_time < duration) {
    // Move the object in steps
    for (size_t i=0; i < N; i++) {
      objects[i].position.x += objects[i].velocity.x * step;
      objects[i].position.y += objects[i].velocity.y * step;
      objects[i].position.z += objects[i].velocity.z * step;
    }
    current_time += step;
  }
}

int main() {
  object_t objects[N];

  init_objects(objects);

  const float duration = SECONDS;
  const float step = 1.0f/STEPSPERSEC;
  struct timeval th_time_start, th_time_end;

  gettimeofday(&th_time_start, NULL);
  simulate_objects(objects, duration, step);
  gettimeofday(&th_time_end, NULL);

  double elapsed;
  elapsed = (th_time_end.tv_sec - th_time_start.tv_sec); // sec
  elapsed += (th_time_end.tv_usec - th_time_start.tv_usec) / 1000000.0; // us to sec

  printf("elapsed time: %f\n", elapsed);
}

Compile the code with GCC:

    

        
        
gcc -O3 simulation1.c -o simulation1

Note

Unless stated, the examples are compiled with gcc version 12. Your results may vary based on your exact compiler version.

Run the program:

    

        
        
./simulation1

The elapsed time is printed. Depending on your hardware, the printed value may may be different.

    

        
        elapsed time: 14.605558

This is not a realistic simulation as there is no collision detection, no bounds checking, no gravity or other forces between the particles. But once you have finished reading it, you will be able to start learning how to change the data layout to help the compiler perform auto vectorization and increase the performance of a loop.

Refer to Learn about Autovectorization for a good introduction to autovectorization.

Use objdump to view the assembly code for the program, look for thesimulate_objects()

    

        
        
objdump -S simulation1 | less

The disassembly for simulate_objects() is:

    

        
        0000000000000b54 <simulate_objects>:
 b54:   1e202018        fcmpe   s0, #0.0
 b58:   1e204006        fmov    s6, s0
 b5c:   1e204023        fmov    s3, s1
 b60:   0e040424        dup     v4.2s, v1.s[0]
 b64:   5400004c        b.gt    b6c <simulate_objects+0x18>
 b68:   d65f03c0        ret
 b6c:   914f4001        add     x1, x0, #0x3d0, lsl #12
 b70:   0f000405        movi    v5.2s, #0x0
 b74:   91002002        add     x2, x0, #0x8
 b78:   91242021        add     x1, x1, #0x908
 b7c:   aa0203e0        mov     x0, x2
 b80:   bd401402        ldr     s2, [x0, #20]
 b84:   9100a000        add     x0, x0, #0x28
 b88:   bc5e0000        ldur    s0, [x0, #-32]
 b8c:   fc5d8001        ldur    d1, [x0, #-40]
 b90:   1f030040        fmadd   s0, s2, s3, s0
 b94:   fc5e4002        ldur    d2, [x0, #-28]
 b98:   0e22cc81        fmla    v1.2s, v4.2s, v2.2s
 b9c:   fc1d8001        stur    d1, [x0, #-40]
 ba0:   bc1e0000        stur    s0, [x0, #-32]
 ba4:   eb00003f        cmp     x1, x0
 ba8:   54fffec1        b.ne    b80 <simulate_objects+0x2c>  // b.any
 bac:   1e2328a5        fadd    s5, s5, s3
 bb0:   1e2520d0        fcmpe   s6, s5
 bb4:   54fffe4c        b.gt    b7c <simulate_objects+0x28>
 bb8:   d65f03c0        ret

You may have spotted a few SIMD instructions (especially fmla v1.2s, v4.2s, v2.2s). This is better than no SIMD instructions, but it looks like the performance is not optimal.

Continue to the next section to learn how to optimize the code.

Back

Optimize SIMD code with vectorization-friendly data layout

Introduction

What exactly is data layout?

Improve data alignment

Increase complexity

Write hand optimized SIMD code

Structure of arrays

Migrate to the Scalable Vector Extension (SVE)

Next Steps

Optimize SIMD code with vectorization-friendly data layout

Data layout basics