Search Unity

Burst, SIMD and float3 / float4 - best practices

Discussion in 'Burst' started by korzen303, Apr 19, 2018.

  1. korzen303

    korzen303

    Joined:
    Oct 2, 2012
    Posts:
    223
    Hi, I am preparing a tutorial on JobSystem and Burst. I did some simple Burst performance measures using different approaches to float3/4 handling in terms of SIMD. I was always confused when writing algorithms and mixing these two types. I wasn't sure whether there are some extra operations added due to swizzles, data loading etc.

    Although, it is a simple synthetic test, maybe some of you will find it useful.

    In the table are time measures in ms for 100k, 1M, 10M elements in data arrays and instructions count in the brackets

    Instructions Count
    Float3 (18)
    0.28ms 2.19ms 21.8ms

    Float4to3 (15)
    0.28ms 2.62ms 26.1ms

    Float4 (14)
    0.28ms 2.64ms 26.1ms

    Also this is interesting in the context of the GPU but I guess also valid for SIMD.
    https://www.gamedev.net/forums/topi...etimes-float3/?do=findComment&comment=3188515


    The question is whether the Burst compiler can also combine different ops together and what are the best practices regarding this topic, in general. If someone from Unity (@Joachim_Ante?) could comment on this, that would be much appreciated.



    First, just for the reference, a scalar version. 14 assembly instructions

    Code (CSharp):
    1. [ComputeJobOptimization]
    2. struct Float1Job : IJob
    3. {
    4.     public int dataSize;
    5.     [ReadOnly] public NativeArray<float> dataA;
    6.     [ReadOnly] public NativeArray<float> dataB;
    7.     [WriteOnly] public NativeArray<float> dataOut;
    8.  
    9.     public void Execute()
    10.     {
    11.         for (int i = 0; i < dataSize; i++)
    12.         {
    13.             float a = dataA[i];
    14.             float b = dataB[i];
    15.             float sum = a + b;
    16.             float mul = a * b;
    17.             float res = (sum - mul) / 10.0f;
    18.             dataOut[i] = res;
    19.         }
    20.     }
    21. }
    Code (CSharp):
    1.  
    2. mov     r8, qword ptr [rcx + 8]
    3. mov     r9, qword ptr [rcx + 64]
    4. movss   xmm1, dword ptr [r8 + rax]
    5. movss   xmm2, dword ptr [r9 + rax]
    6. movaps  xmm3, xmm2
    7. addss   xmm3, xmm1
    8. mulss   xmm2, xmm1
    9. subss   xmm3, xmm2
    10. mulss   xmm3, xmm0
    11. mov     rdx, qword ptr [rcx + 120]
    12. movss   dword ptr [rdx + rax], xmm3
    13. inc     r10d
    14. add     eax, 4
    15. cmp     r10d, dword ptr [rcx]
    16.  

    Next, both data in calculations using float3s. We get 18 instructions, a few extra insertps and extractps instructions to prepare the data for SIMD:

    Code (CSharp):
    1. [ComputeJobOptimization]
    2. struct Float3Job : IJob
    3. {
    4.     public int dataSize;
    5.  
    6.     [ReadOnly] public NativeArray<float3> dataA;
    7.     [ReadOnly] public NativeArray<float3> dataB;
    8.     [WriteOnly] public NativeArray<float3> dataOut;
    9.  
    10.     public void Execute()
    11.     {
    12.         for (int i = 0; i < dataSize; i++)
    13.         {
    14.             float3 a = dataA[i];
    15.             float3 b = dataB[i];
    16.             float3 sum = a + b;
    17.             float3 mul = a * b;
    18.             float3 res = (sum - mul) / 10.0f;
    19.             dataOut[i] = res;
    20.         }
    21.     }
    22. }
    Code (CSharp):
    1. cdqe
    2. mov     r8, qword ptr [rcx + 8]
    3. mov     r9, qword ptr [rcx + 64]
    4. movsd   xmm1, qword ptr [r8 + rax]
    5. insertps        xmm1, dword ptr [r8 + rax + 8], 32
    6. movsd   xmm2, qword ptr [r9 + rax]
    7. insertps        xmm2, dword ptr [r9 + rax + 8], 32
    8. movaps  xmm3, xmm2
    9. addps   xmm3, xmm1
    10. mulps   xmm2, xmm1
    11. subps   xmm3, xmm2
    12. mulps   xmm3, xmm0
    13. mov     rdx, qword ptr [rcx + 120]
    14. movss   dword ptr [rdx + rax], xmm3
    15. extractps       dword ptr [rdx + rax + 4], xmm3, 1
    16. extractps       dword ptr [rdx + rax + 8], xmm3, 2
    17. inc     r10d
    18. add     eax, 12
    19. cmp     r10d, dword ptr [rcx]
    20.  

    Next, I used float4 arrays but used swizzles to convert them to float3s for the computations. We get 15 instructions with only one extra blendps:

    Code (CSharp):
    1. [ComputeJobOptimization]
    2. struct Float4to3Job : IJob
    3. {
    4.     public int dataSize;
    5.  
    6.     [ReadOnly] public NativeArray<float4> dataA;
    7.     [ReadOnly] public NativeArray<float4> dataB;
    8.     [WriteOnly] public NativeArray<float4> dataOut;
    9.  
    10.     public void Execute()
    11.     {
    12.         for (int i = 0; i < dataSize; i++)
    13.         {
    14.             float3 a = dataA[i].xyz;
    15.             float3 b = dataB[i].xyz;
    16.             float3 sum = a + b;
    17.             float3 mul = a * b;
    18.             float3 res = (sum - mul) / 10.0f;
    19.             dataOut[i] = new float4(res, 0);
    20.         }
    21.     }
    22. }
    Code (CSharp):
    1. cdqe
    2. mov     r8, qword ptr [rcx + 8]
    3. mov     r9, qword ptr [rcx + 64]
    4. movups  xmm2, xmmword ptr [r8 + rax]
    5. movups  xmm3, xmmword ptr [r9 + rax]
    6. movaps  xmm4, xmm3
    7. addps   xmm4, xmm2
    8. mulps   xmm2, xmm3
    9. subps   xmm4, xmm2
    10. mulps   xmm4, xmm0
    11. blendps xmm4, xmm1, 8
    12. mov     rdx, qword ptr [rcx + 120]
    13. movups  xmmword ptr [rdx + rax], xmm4
    14. inc     r10d
    15. add     eax, 16
    16. cmp     r10d, dword ptr [rcx]
    17.  
    Finally, both the data and computations using float4s. We get 14 instructions, just as a reference scalar code.

    Code (CSharp):
    1. [ComputeJobOptimization]
    2. struct Float4Job : IJob
    3. {
    4.     public int dataSize;
    5.  
    6.     [ReadOnly] public NativeArray<float4> dataA;
    7.     [ReadOnly] public NativeArray<float4> dataB;
    8.     [WriteOnly] public NativeArray<float4> dataOut;
    9.  
    10.     public void Execute()
    11.     {
    12.         for (int i = 0; i < dataSize; i++)
    13.         {
    14.             float4 a = dataA[i];
    15.             float4 b = dataB[i];
    16.             float4 sum = a + b;
    17.             float4 mul = a * b;
    18.             float4 res = (sum - mul)/ 10.0f;
    19.             dataOut[i] = res;
    20.         }
    21.     }
    22. }
    Code (CSharp):
    1. cdqe
    2. mov     r8, qword ptr [rcx + 8]
    3. mov     r9, qword ptr [rcx + 64]
    4. movups  xmm1, xmmword ptr [r8 + rax]
    5. movups  xmm2, xmmword ptr [r9 + rax]
    6. movaps  xmm3, xmm2
    7. addps   xmm3, xmm1
    8. mulps   xmm1, xmm2
    9. subps   xmm3, xmm1
    10. mulps   xmm3, xmm0
    11. mov     rdx, qword ptr [rcx + 120]
    12. movups  xmmword ptr [rdx + rax], xmm3
    13. inc     r10d
    14. add     eax, 16
    15. cmp     r10d, dword ptr [rcx]
    16.  
     
  2. korzen303

    korzen303

    Joined:
    Oct 2, 2012
    Posts:
    223
    So I have made an additional test and it seems that Burst wont combine float3 and float operations into single SIMD instruction. In other words, it will load float3 as float4 to do the SIMD addps and then it will do the scalar addss on the float.

    In this case, I think that the conclusion is: if we are not memory bound, it is the fastest to store the data as float4s. We get all the basic arithmetic operations (addps, mulps etc.) with no insertps and extractps overhead and using float3 ops such as math.cross via float4.xyz swizzle costs only a single extra blend ops.

    Also the Burst compiler won't unroll the loops (https://forum.unity.com/threads/loop-vectorization-unroll.527355/) nor combine statements like below into SIMD. Long story short, you need to explicitly ensure that the vectorization is being actually used. Or am I missing something?

    Code (CSharp):
    1.        
    2. for (int i = 0; i < dataSize; i+=4)
    3. {
    4.     dataOut[i + 0] = dataA[i + 0] + dataB[i + 0];
    5.     dataOut[i + 1] = dataA[i + 1] + dataB[i + 1];
    6.     dataOut[i + 2] = dataA[i + 2] + dataB[i + 2];
    7.     dataOut[i + 3] = dataA[i + 3] + dataB[i + 3];
    8. }
    9.  


    Code (CSharp):
    1. [ComputeJobOptimization]
    2. struct FloatAndFloat3Job : IJob
    3. {
    4.     public int dataSize;
    5.  
    6.     [ReadOnly] public NativeArray<float> dataS;
    7.     [ReadOnly] public NativeArray<float3> dataA;
    8.     [ReadOnly] public NativeArray<float3> dataB;
    9.     [WriteOnly] public NativeArray<float3> dataOut;
    10.  
    11.     public void Execute()
    12.     {
    13.         for (int i = 0; i < dataSize; i++)
    14.         {
    15.             float3 a = dataA[i].xyz;
    16.             float3 b = dataB[i].xyz;
    17.             float s = dataS[i];
    18.             float3 sum = a + b;
    19.             float sumS = s + s;
    20.             float3 mul = a * b;
    21.             float3 res = (sum - mul) ;
    22.             dataOut[i] = res * sumS;
    23.         }
    24.     }
    25. }
    Code (CSharp):
    1. mov     r9, qword ptr [rcx + 8]
    2. mov     r10, qword ptr [rcx + 64]
    3. cdqe
    4. movsd   xmm0, qword ptr [r10 + rax]
    5. insertps        xmm0, dword ptr [r10 + rax + 8], 32
    6. mov     r10, qword ptr [rcx + 120]
    7. movsd   xmm1, qword ptr [r10 + rax]
    8. insertps        xmm1, dword ptr [r10 + rax + 8], 32
    9. movsxd  rdx, edx
    10. movss   xmm2, dword ptr [r9 + rdx]
    11. movaps  xmm3, xmm1
    12. addps   xmm3, xmm0
    13. addss   xmm2, xmm2
    14. mulps   xmm1, xmm0
    15. subps   xmm3, xmm1
    16. shufps  xmm2, xmm2, 192
    17. mulps   xmm2, xmm3
    18. mov     r9, qword ptr [rcx + 176]
    19. movss   dword ptr [r9 + rax], xmm2
    20. extractps       dword ptr [r9 + rax + 4], xmm2, 1
    21. extractps       dword ptr [r9 + rax + 8], xmm2, 2
    22. inc     r8d
    23. add     edx, 4
    24. add     eax, 12
    25. cmp     r8d, dword ptr [rcx]
    26. jl      .LBB0_2
    27.  
     
    Last edited: Apr 19, 2018
  3. korzen303

    korzen303

    Joined:
    Oct 2, 2012
    Posts:
    223
    Due to the above, simplyconverting the data from float1 array to float4 for SIMD requirers more than a double of instructions (10 vs 23)! Is there any better way of doing this? Like some casts of variables or maybe even whole arrays when setting up the jobs?


    Code (CSharp):
    1. for (int i = 0; i < dataSize; i ++)
    2. {
    3.    //data arrays are float4
    4.     float4 a =   dataA[i];
    5.     float4 b =  dataB[i];
    6.  
    7.     float4 sum = a + b;
    8.  
    9.     dataOut[i] = sum;
    10. }
    10 instructions
    Code (CSharp):
    1. mov     r8, qword ptr [rcx + 8]
    2. mov     r9, qword ptr [rcx + 64]
    3. movups  xmm0, xmmword ptr [r8 + rax]
    4. movups  xmm1, xmmword ptr [r9 + rax]
    5. addps   xmm1, xmm0
    6. mov     rdx, qword ptr [rcx + 120]
    7. movups  xmmword ptr [rdx + rax], xmm1
    8. inc     r10d
    9. add     eax, 16
    10. cmp     r10d, dword ptr [rcx]


    Code (CSharp):
    1. for (int i = 0; i < dataSize; i+=4)
    2. {
    3.    //data arrays are float1
    4.     float4 a = new float4(dataA[i + 0], dataA[i + 1], dataA[i + 2], dataA[i + 3]);
    5.     float4 b = new float4(dataB[i + 0], dataB[i + 1], dataB[i + 2], dataB[i + 3]);
    6.  
    7.     float4 sum = a + b;
    8.  
    9.     dataOut[i + 0] = sum.x;
    10.     dataOut[i + 1] = sum.y;
    11.     dataOut[i + 2] = sum.z;
    12.     dataOut[i + 3] = sum.w;
    13. }
    23 instructions
    Code (CSharp):
    1. lea     eax, [rdx - 12]
    2. movsxd  r9, eax
    3. lea     eax, [rdx - 8]
    4. movsxd  r10, eax
    5. lea     eax, [rdx - 4]
    6. movsxd  r11, eax
    7. movsxd  rdx, edx
    8. mov     rax, qword ptr [rcx + 8]
    9. mov     rsi, qword ptr [rcx + 64]
    10. movups  xmm0, xmmword ptr [rax + r9]
    11. movups  xmm1, xmmword ptr [rsi + r9]
    12. addps   xmm1, xmm0
    13. mov     rax, qword ptr [rcx + 120]
    14. movss   dword ptr [rax + r9], xmm1
    15. mov     rax, qword ptr [rcx + 120]
    16. extractps       dword ptr [rax + r10], xmm1, 1
    17. mov     rax, qword ptr [rcx + 120]
    18. extractps       dword ptr [rax + r11], xmm1, 2
    19. mov     rax, qword ptr [rcx + 120]
    20. extractps       dword ptr [rax + rdx], xmm1, 3
    21. add     r8d, 4
    22. add     edx, 16
    23. cmp     r8d, dword ptr [rcx]
     
    Last edited: Apr 19, 2018
  4. M_R

    M_R

    Joined:
    Apr 15, 2015
    Posts:
    559
    if you have an array of float1 you need to convert them somewhere.
    or you can do the equivalent of C++ "reinterpret_cast<float4*>(float1Array)" using unsafe code
     
  5. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    5,203
    If you want to ensure that your code is 100% vectorised. Keep everything as float4 / int4 and do manual SOA form.

    Over time we will keep improving burst and particularly the auto-vectorisation to make sure we produce the best possible code when it is written as scalar code.
     
  6. korzen303

    korzen303

    Joined:
    Oct 2, 2012
    Posts:
    223
    Noisecrime likes this.
  7. xoofx

    xoofx

    Unity Technologies

    Joined:
    Nov 5, 2016
    Posts:
    417
    Sorry for the late feedback.

    Currently, there is an issue with the released version of burst 0.2.3 that has a regression with noalias. As you have experienced, all your scalar loops are not auto-vectorized. It has been fixed on our side, we will publish hopefully a new version in the coming days.

    Note that there will be still cases where the auto-vectorizer won't be able to do its job. For the 2018.2 beta, we will give more details about these cases (some of them are in this thread) and we will work to improve auto-vectorization in subsequent releases
     
    optimise likes this.
  8. xoofx

    xoofx

    Unity Technologies

    Joined:
    Nov 5, 2016
    Posts:
    417
    FYI, we just released a new version of burst that is fixing the regression when auto-vectorizing scalar loops. You can update to latest burst package `0.2.4-preview.4` (you can follow this post on how to update your manifest.json)

    This should vectorize correctly the scalar loop in your example above and make it equivalent to a manual float4 loop.