Search Unity

[Bug] Burst performance issue with specific code

Discussion in 'Burst' started by tertle, Sep 24, 2019.

  1. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,761
    This is a very weird, very specific burst performance issue.

    Was doing some benchmarks between a fixed point library and float and turned off a bunch of checks and the fixed point library suddenly performed much worse, even though it was executing less code.

    I simplified this down to a basic performance test and it pretty much comes down to this.


    This

    Code (CSharp):
    1.                         var product = (long)this.Input[i] * this.Input[j];
    2.                         var result = product >> 12;
    3.                         this.Output[i] = (int)result;
    executes 8x slower than

    Code (CSharp):
    1.                         var product = (long)this.Input[i] * this.Input[j];
    2.                         var result = product >> 12;
    3.  
    4.                         // 1 extra line of code which speeds up job 8x
    5.                         result += (product & IntSignBit) >> (12 - 1);
    6.  
    7.                         this.Output[i] = (int)result;
    Which makes no sense to me as the code is the same exact it has extra operations (add, sub, shift, and). I'd expect it to run slightly slower, not nearly an order of magnitude faster.

    .I have repeated this dozens of times, tweaking stuff, run the test in different orders, upgraded burst (this is running on latest package now same result), forced recompiles and the result are the same.

    Results:

    upload_2019-9-24_15-26-29.png

    Full Source:

    The burst jobs

    Code (CSharp):
    1.  
    2. [BurstCompile]
    3. private struct Test1Job : IJob
    4. {
    5.     public NativeArray<int> Input;
    6.     public NativeArray<int> Output;
    7.  
    8.     public void Execute()
    9.     {
    10.         for (var i = 0; i < Count; i++)
    11.         {
    12.             for (var j = 0; j < Count; j++)
    13.             {
    14.                 var product = (long)this.Input[i] * this.Input[j];
    15.                 var result = product >> 12;
    16.                 this.Output[i] = (int)result;
    17.             }
    18.         }
    19.     }
    20. }
    21.  
    22. [BurstCompile]
    23. private struct Test2Job : IJob
    24. {
    25.     public NativeArray<int> Input;
    26.     public NativeArray<int> Output;
    27.  
    28.     private const int IntSignBit = 1 << (12 - 1);
    29.  
    30.     public void Execute()
    31.     {
    32.         for (var i = 0; i < Count; i++)
    33.         {
    34.             for (var j = 0; j < Count; j++)
    35.             {
    36.                 var product = (long)this.Input[i] * this.Input[j];
    37.                 var result = product >> 12;
    38.  
    39.                 // 1 extra line of code which speeds up job 8x
    40.                 result += (product & IntSignBit) >> (12 - 1);
    41.  
    42.                 this.Output[i] = (int)result;
    43.             }
    44.         }
    45.     }
    46. }
    47.  
    The tests

    Code (CSharp):
    1.  
    2. [Test]
    3. [Performance]
    4. public void Test1()
    5. {
    6.     NativeArray<int> input = default;
    7.     NativeArray<int> output = default;
    8.  
    9.     input = new NativeArray<int>(Count, Allocator.TempJob);
    10.     output = new NativeArray<int>(Count, Allocator.TempJob);
    11.  
    12.     for (var i = 0; i < Count; i++)
    13.     {
    14.         input[i] = Random.Range((int)MultiMin, (int)MultiMax);
    15.     }
    16.  
    17.     Measure.Method(() =>
    18.         {
    19.             new Test1Job
    20.                 {
    21.                     Input = input,
    22.                     Output = output,
    23.                 }
    24.                 .Schedule().Complete();
    25.         })
    26.         .Run();
    27.  
    28.     input.Dispose();
    29.     output.Dispose();
    30. }
    31.  
    32. [Test]
    33. [Performance]
    34. public void Test2()
    35. {
    36.     NativeArray<int> input = default;
    37.     NativeArray<int> output = default;
    38.  
    39.     input = new NativeArray<int>(Count, Allocator.TempJob);
    40.     output = new NativeArray<int>(Count, Allocator.TempJob);
    41.  
    42.     for (var i = 0; i < Count; i++)
    43.     {
    44.         input[i] = Random.Range((int)MultiMin, (int)MultiMax);
    45.     }
    46.  
    47.     Measure.Method(() =>
    48.         {
    49.             new Test2Job
    50.                 {
    51.                     Input = input,
    52.                     Output = output,
    53.                 }
    54.                 .Schedule().Complete();
    55.         })
    56.         .Run();
    57.  
    58.     input.Dispose();
    59.     output.Dispose();
    60. }
    61.  
    Please tell me I've done something stupid. I can't see it though.

    -edit-

    safety system off
    leak detection off
    job debugger off
    burst compilation on
    synchronous compilation on
     
    Last edited: Sep 24, 2019
    Deleted User likes this.
  2. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,761
    Submitted a bug report: 1186429 with this test.
     
    Grizmu likes this.
  3. sheredom

    sheredom

    Unity Technologies

    Joined:
    Jul 15, 2019
    Posts:
    300
    I can reproduce - basically the second job is getting vectorized while the first isn't - don't have a handle on why yet though!
     
  4. sheredom

    sheredom

    Unity Technologies

    Joined:
    Jul 15, 2019
    Posts:
    300
    So our compiler backend LLVM decides that for the first job that 'Vectorization is possible but not beneficial' (which obviously isn't the case!).

    This is an annoying issue that we don't have a good solution for yet - we want to allow our users to get guaranteed loop vectorization when they want it (even if it caused degraded performance in actuality!), but we don't have a handle on exactly how to expose that yet.
     
    JesOb and Grizmu like this.
  5. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,761
    Good quick response, thanks! I'm glad I'm not going insane. I'm pretty certain I checked around 15 times that I had scheduled Job1 in Test1, Job2 in Test2.
     
    Last edited: Sep 24, 2019
    sheredom likes this.
  6. sheredom

    sheredom

    Unity Technologies

    Joined:
    Jul 15, 2019
    Posts:
    300
    So I started to look at this again while trying to work out why vectorization wasn't possible, and I suddenly realised that the inner loop can basically be folded away to nothing if the compiler was smart enough, because each loop iteration is overwriting the same location in Output:

    Code (CSharp):
    1. for (var i = 0; i < Count; i++)
    2. {
    3.     for (var j = 0; j < Count; j++)
    4.     {
    5.         var product = (long)this.Input[i] * this.Input[j];
    6.         var result = product >> 12;
    7.  
    8.         // All loop iterations will replace the contents of this variable with a new value, so the loop is redundant!
    9.         this.Output[i] = (int)result;
    10.     }
    11. }
    Were you intending this? If I make Output sized to Count * Count. and store into this.Output[i * Count + j] the code is vectorized like I would expect.
     
    tertle, GilCat and florianhanke like this.
  7. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,761
    So this was mostly just a simple repo and I just used a couple of loops so I could actually benchmark it. Sorry if my repo was a bit deceptive, but it was actually just an issue with a method and it popped up in all sorts of circumstances.

    I stopped using this library but pulling it from repos I think it was from this method.

    Code (CSharp):
    1. public static fix operator *(fix x, fix y)
    2.         {
    3.             var product = (long)x.rawValue * y.rawValue;
    4.  
    5. #if !FIXMATH_NO_OVERFLOW
    6.             // The upper X bits should all be the same (the sign).
    7.             var upper = (uint)(product >> MultiSign);
    8. #endif
    9.  
    10. #if !FIXMATH_NO_OVERFLOW || !FIXMATH_NO_ROUNDING
    11.             if (product < 0)
    12.             {
    13. #if !FIXMATH_NO_OVERFLOW
    14.                 if (~upper != 0)
    15.                 {
    16.                     return Overflow;
    17.                 }
    18. #endif
    19.  
    20.                 // This adjustment is required in order to round -1/2 correctly
    21. #if !FIXMATH_NO_ROUNDING
    22.                 product--;
    23. #endif
    24.             }
    25.             else
    26.             {
    27. #if !FIXMATH_NO_OVERFLOW
    28.                 if (upper != 0)
    29.                 {
    30.                     return Overflow;
    31.                 }
    32. #endif
    33.             }
    34. #endif
    35.  
    36. #if FIXMATH_NO_ROUNDING
    37.             return new fix((int)(product >> Shift));
    38. #else
    39.             var result = product >> Shift;
    40.             result += (product & IntSignBit) >> (Shift - 1);
    41.  
    42.             return new fix((int)result);
    43. #endif
    44.         }
    45.  
    After stripping it all back and testing it independently I found it was just this code right at the bottom, which matches the work done in the above repo.

    Code (CSharp):
    1. #if FIXMATH_NO_ROUNDING
    2.             return new fix((int)(product >> Shift));
    3. #else
    4.             var result = product >> Shift;
    5.             result += (product & IntSignBit) >> (Shift - 1);
    6.  
    7.             return new fix((int)result);
    8. #endif
    Basically my jobs always executed noticeably faster if FIXMATH_NO_ROUNDING was not defined even though it was more operations.
     
    sheredom likes this.