Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. We have updated the language to the Editor Terms based on feedback from our employees and community. Learn more.
    Dismiss Notice

Burst/Jobs best practices for data layout

Discussion in 'Burst' started by Robber33, Nov 22, 2019.

  1. Robber33

    Robber33

    Joined:
    Feb 22, 2015
    Posts:
    52
    as a simple example, say i would have a paralelfor job that adds velocities to positions

    would it be advisable to feed the job a nativeArray of float3 position and nativeArray of float3 velocities or perhaps a nativeArray of structs with the following setup ?

    struct {
    float x;
    float vx;
    float y;
    float vy;
    float z;
    float vz;
    }

    or is this something Burst compiler will optimise out for me ?
     
  2. Radu392

    Radu392

    Joined:
    Jan 6, 2016
    Posts:
    210
    For best practice, you should use dynamic buffers so you don’t have to sync with the main thread too much. You also keep your data as separated as possible, not like the struct you just posted. That is because some jobs might need positions only while other jobs might need both positions and velocities.
     
  3. thebanjomatic

    thebanjomatic

    Joined:
    Nov 13, 2016
    Posts:
    36
    The other thing to keep in mind is how this all maps to SIMD. If you are just component-wise doing this:

    Code (CSharp):
    1. pos[i].x = pos[i].x + vel[i].x;
    2. pos[i].y = pos[i].y + vel[i].y;
    3. pos[i].z = pos[i].z + vel[i].z;
    Then the best performance would be achieved from two separate NativeArrays of of float4 (instead of float3 for alignment/size reasons) by essentially doing:
    Code (CSharp):
    1. pos[i] = pos[i] + vel[i];
    While you are essentially using 33% more data with a float4 vs float3, the processing should be faster as you can operate on the to do any shuffling the data to be able to take advantage of the 128bit instructions.
     
    vildauget likes this.
  4. Robber33

    Robber33

    Joined:
    Feb 22, 2015
    Posts:
    52
    thanks for the responses, i'm trying to understand how caching actually works

    if it has to get position from the position array and then velocity from the velocity array,
    won't this cause cache misses as it has to switch arrays ?

    that why i'm wondering if you put them together in an array, say interleaving the position and velocity data you'll get better performance ?
     
  5. thebanjomatic

    thebanjomatic

    Joined:
    Nov 13, 2016
    Posts:
    36
    @Robber33
    Unfortunately I'm a bit unclear on this point myself. I believe the processor can detect that you are reading from two sets of contiguous memory and will prefetch from both arrays, but honestly I'm not really sure about that. There is also the AoSoA data layout which is kind of the best of both worlds with regard to data layout and cache locality.

    In AoSoA, you would use the float4 not as a drop in replacement for a point/vector (x,y,z,w) but instead as a way of just representing a more general 4 floats. In memory an array of these objects would look like:
    x x x x vx vx vx vx y y y y vy vy vy vy z z z z vz vz vz vz

    More concretely, you get a structure like this:

    Code (CSharp):
    1. struct PosAndVelocity4 {
    2.   float4 x;
    3.   float4 vx;
    4.   float4 y;
    5.   float4 vy;
    6.   float4 z;
    7.   float4 vz;
    8. }
    9. ...
    10.  
    11. points[i].x = points[i].x + points[i].vx;  // Adds 4 x's in one instruction
    12. points[i].y = points[i].y + points[i].vy;  // Adds 4 y's in one instruction
    13. points[i].z = points[i].z + points[i].vz;  // Adds 4 z's in one instruction
    Working with memory in this format is a bit more complicated, and its not clear to me if it is a good idea to store the data in that format or just to have a preprocessing step that takes
    NativeArray<float4> positions
    and
    NativeArray<float4> velocity
    and combines them into
    NativeArray<PosAndVelocity4>
    when you need to do SIMD heavy processing on it.

    I found the following presentation very insightful: https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf
    In particular, the section starting on page 38
     
    Last edited: Nov 26, 2019
  6. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,993
    That's TCM (which is becoming much less common), not cache. Cache works in small segments of memory and there are many cache lines mapped to many different spots in memory at once. Look up 4/8-way set associative mapping to get a better feel for how this works.
     
    Robber33 likes this.