Search Unity

Loop Vectorization (unroll)

Discussion in 'Entity Component System' started by korzen303, Apr 18, 2018.

  1. korzen303

    korzen303

    Joined:
    Oct 2, 2012
    Posts:
    223
    I have been preparing a tutorial on Job System and Burst and wanted to demonstrate simple SIMD optimizations. I have manually unrolled a loop yet the generated code is still non-vectorized (this is also the case for IJobParallelFor)

    EDIT: Here is a more in-depth analysis: https://forum.unity.com/threads/burst-simd-and-float3-float4-best-practices.527504/

    Code (CSharp):
    1. [ComputeJobOptimization]
    2. struct SumUnrollJob : IJob
    3. {
    4.     public int dataSize;
    5.  
    6.     [ReadOnly] public NativeArray<float> dataA;
    7.  
    8.     [ReadOnly] public NativeArray<float> dataB;
    9.  
    10.     [WriteOnly] public NativeArray<float> dataOut;
    11.  
    12.     public void Execute()
    13.     {
    14.         //UNROLL
    15.         for (int i = 0; i < dataSize; i+=4)
    16.         {
    17.             dataOut[i + 0] = dataA[i + 0] + dataB[i + 0];
    18.             dataOut[i + 1] = dataA[i + 1] + dataB[i + 1];
    19.             dataOut[i + 2] = dataA[i + 2] + dataB[i + 2];
    20.             dataOut[i + 3] = dataA[i + 3] + dataB[i + 3];
    21.         }
    22.     }
    23. }
    Code (CSharp):
    1. lea     eax, [rdx - 12]
    2. movsxd  r10, eax
    3. mov     r9, qword ptr [rcx + 8]
    4. mov     rax, qword ptr [rcx + 64]
    5. movss   xmm0, dword ptr [rax + r10]
    6. addss   xmm0, dword ptr [r9 + r10]
    7. mov     rax, qword ptr [rcx + 120]
    8. movss   dword ptr [rax + r10], xmm0
    9. lea     eax, [rdx - 8]
    10. movsxd  r10, eax
    11. mov     r9, qword ptr [rcx + 8]
    12. mov     rax, qword ptr [rcx + 64]
    13. movss   xmm0, dword ptr [rax + r10]
    14. addss   xmm0, dword ptr [r9 + r10]
    15. mov     rax, qword ptr [rcx + 120]
    16. movss   dword ptr [rax + r10], xmm0
    17. lea     eax, [rdx - 4]
    18. movsxd  r10, eax
    19. mov     r9, qword ptr [rcx + 8]
    20. mov     rax, qword ptr [rcx + 64]
    21. movss   xmm0, dword ptr [rax + r10]
    22. addss   xmm0, dword ptr [r9 + r10]
    23. mov     rax, qword ptr [rcx + 120]
    24. movss   dword ptr [rax + r10], xmm0
    25. movsxd  rdx, edx
    26. mov     r9, qword ptr [rcx + 8]
    27. mov     rax, qword ptr [rcx + 64]
    28. movss   xmm0, dword ptr [rax + rdx]
    29. addss   xmm0, dword ptr [r9 + rdx]
    30. mov     rax, qword ptr [rcx + 120]
    31. movss   dword ptr [rax + rdx], xmm0
    32. add     r8d, 4
    33. add     edx, 16
    34. cmp     r8d, dword ptr [rcx]
    35. jl      .LBB0_2
    36.  
    My next approach was to explicitly use float4. Although, the generated assembly got vectorized addps I didn't notice any speed-ups (around 20ms for 1M elements). Hence, I have been wondering whether it is the most optimal in terms of memory reads/writes (it seems there is quite a lot going on regarding this).

    Code (CSharp):
    1.  
    2. [ComputeJobOptimization]
    3. struct SumUnroll2Job : IJob
    4. {
    5.     public int dataSize;
    6.  
    7.     [ReadOnly] public NativeArray<float> dataA;
    8.     [ReadOnly] public NativeArray<float> dataB;
    9.     [WriteOnly] public NativeArray<float> dataOut;
    10.  
    11.     public void Execute()
    12.     {
    13.         //UNROLL
    14.         for (int i = 0; i < dataSize; i+=4)
    15.         {
    16.             float4 a = new float4(dataA[i + 0], dataA[i + 1], dataA[i + 2], dataA[i + 3]);
    17.             float4 b = new float4(dataB[i + 0], dataB[i + 1], dataB[i + 2], dataB[i + 3]);
    18.  
    19.             float4 sum = a + b;
    20.  
    21.             dataOut[i + 0] = sum.x;
    22.             dataOut[i + 1] = sum.y;
    23.             dataOut[i + 2] = sum.z;
    24.             dataOut[i + 3] = sum.w;
    25.         }
    26.     }
    27. }
    Code (CSharp):
    1. lea     eax, [rdx - 12]
    2. movsxd  r9, eax
    3. lea     eax, [rdx - 8]
    4. movsxd  r10, eax
    5. lea     eax, [rdx - 4]
    6. movsxd  r11, eax
    7. movsxd  rdx, edx
    8. mov     rax, qword ptr [rcx + 8]
    9. mov     rsi, qword ptr [rcx + 64]
    10. movups  xmm0, xmmword ptr [rax + r9]
    11. movups  xmm1, xmmword ptr [rsi + r9]
    12. addps   xmm1, xmm0
    13. mov     rax, qword ptr [rcx + 120]
    14. movss   dword ptr [rax + r9], xmm1
    15. mov     rax, qword ptr [rcx + 120]
    16. extractps       dword ptr [rax + r10], xmm1, 1
    17. mov     rax, qword ptr [rcx + 120]
    18. extractps       dword ptr [rax + r11], xmm1, 2
    19. mov     rax, qword ptr [rcx + 120]
    20. extractps       dword ptr [rax + rdx], xmm1, 3
    21. add     r8d, 4
    22. add     edx, 16
    23. cmp     r8d, dword ptr [rcx]
    24. jl      .LBB0_2
    25.  
     
    Last edited: Apr 19, 2018
    laurentlavigne likes this.
  2. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    5,203
    One thing to check for in terms of performance...
    "Jobs -> Enable Burst Safety Checks"

    Make sure thats disabled when entering playmode. All the safety checks have big impact on performance for simple loops.
     
    Zoey_O likes this.
  3. deplinenoise

    deplinenoise

    Unity Technologies

    Joined:
    Dec 20, 2017
    Posts:
    33
    If you change your NativeArrays to contain float4, and let the inner loop just be dataOut = dataA + dataB you should see much better codegen.
     
  4. deplinenoise

    deplinenoise

    Unity Technologies

    Joined:
    Dec 20, 2017
    Posts:
    33
    There's is a problem with aliasing analysis in the current 0.2.3 version of burst that is causing your second loop to look much worse than it should. We're tracking this issue internally.
     
    FROS7 likes this.
  5. GabrieleUnity

    GabrieleUnity

    Unity Technologies

    Joined:
    Sep 4, 2012
    Posts:
    116
    @korzen303 We fixed the aliasing issues in the latest burst release. We will update everything soon.
     
    FROS7 and optimise like this.