Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice
  2. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

Benchmarking Burst/IL2CPP against GCC/Clang machine code (Fibonacci, Mandelbrot, NBody and others)

Discussion in 'Burst' started by nxrighthere, Jul 23, 2019.

Thread Status:
Not open for further replies.
  1. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    I was curious how well Burst/IL2CPP optimizes C# code against GCC/Clang with C, so I've ported five famous benchmarks, plus a raytracer, a minified flocking simulation, particle kinematics, a stream cipher, and radix sort, with different workloads and made them identical between the two languages. C code compiled with all possible optimizations using -DNDEBUG -Ofast -march=native -flto compiler options. Benchmarks were done on Windows 10 w/ Ryzen 5 1400 using standalone build. Mono JIT and RyuJIT are included for fun.

    Source code and benchmark results are available on GitHub.

    These wonderful people make open-source better:

    x.png

    This project is sponsored by JetBrains.

    jetbrainslogo.png
     
    Last edited: Mar 17, 2020
    bit-master, Krajca, PutridEx and 17 others like this.
  2. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Just easier to see the numbers with thousand commas.
     
    tigerleapgorge likes this.
  3. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Ah, thanks. I was a bit lazy to format numbers manually.
     
  4. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Code (CSharp):
    1. private struct MandelbrotBurst : IJob {
    2.         public uint width;
    3.         public uint height;
    4.         public uint iterations;
    5.         public float result;
    6.  
    7.         public void Execute() {
    8.             result = Mandelbrot(width, height, iterations);
    9.         }
    10.  
    11.         private float Mandelbrot(uint width, uint height, uint iterations) {
    12.             float data = 0.0f;
    13.  
    14.             for (int i = 0; i < iterations; i++) { // ideally you should have for loops broken up into jobs
    15.                 float  // should declare variables outside of loops (*1)
    16.                     left = -2.1f,
    17.                     right = 1.0f,
    18.                     top = -1.3f,
    19.                     bottom = 1.3f,
    20.                     deltaX = (right - left) / width,  //invert this e.g.  invWidth = 1f / width; outside of the loop and multiply.
    21.                     deltaY = (bottom - top) / height, //ditto for height.
    22.                     coordinateX = left;
    23.  
    24.                 for (int x = 0; x < width; x++) { // ideally you should have for loops broken up into jobs
    25.                     float coordinateY = top; // (*1)
    26.  
    27.                     for (int y = 0; y < height; y++) { // ideally you should have for loops broken up into jobs
    28.                         float workX = 0;  // (*1)
    29.                         float workY = 0;
    30.                         int counter = 0;
    31.  
    32. // should use Burst Mathermatics.Sqrt not Math.Sqrt
    33.                         while (counter < 255 && Math.Sqrt((workX * workX) + (workY * workY)) < 2.0f) {
    34.                             counter++;
    35.  
    36. // recalculating workx * workx and worky multiple times in the loop and conditional test.
    37.                             float newX = (workX * workX) - (workY * workY) + coordinateX; //(*1)
    38.  
    39.                             workY = 2 * workX * workY + coordinateY;
    40.                             workX = newX;
    41.                         }
    42.  
    43.                         data = workX + workY;
    44.                         coordinateY += deltaY;
    45.                     }
    46.  
    47.                     coordinateX += deltaX;
    48.                 }
    49.             }
    50.  
    51.             return data;
    52.         }
    53.     }
    Just some thoughts on how you can optimise this code.

    Note you really should break down the for loops into jobs for maximum throughput.
     
    tigerleapgorge likes this.
  5. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Yes, such changes will have an impact. My main goal was to keep the code the same with C using only plain methods and loops with standard math, to see how compiler itself can optimize it for me without spending additional energy. Burst is optimized for tight loops in general, @xoofx already tried Mandelbrot test as far as I know, would like to hear his thoughts regarding this by the way.
     
  6. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Just checked the source code of Unity.Mathematics and math.sqrt() is the same standard System.Math.Sqrt().
     
  7. elcionap

    elcionap

    Joined:
    Jan 11, 2016
    Posts:
    138
    Just add more information:
    https://docs.unity3d.com/Packages/com.unity.burst@1.1/manual/index.html

    TBH I think event if wasn't implemented as intrinsics would be great to show how burst deal with already existing code/libs.

    Thanks for benchmarking it.

    []'s
     
    nxrighthere likes this.
  8. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Thanks, good to know this.

    After inspecting an assembly code produced by GCC, found that to make it faster than Burst in the Fibonacci test, we can use -fno-asynchronous-unwind-tables to suppress the generation of static unwind tables for exception handling. 103,578,985 ticks down to 84,983,484, but other tests remain unaffected. Will add this as a note.
     
    Last edited: Jul 23, 2019
  9. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Can Burst optimise Sqrt?
     
  10. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    A -> (Burst) Mandelbrot: 33,005,292 ticks
    B -> (Burst) Mandelbrot: 6,495,352 ticks

    Version B
    Code (CSharp):
    1. private float Mandelbrot(uint width, uint height, uint iterations)
    2.         {
    3.             float data = 0.0f;
    4.  
    5.             float
    6.                     left = -2.1f,
    7.                     right = 1.0f,
    8.                     top = -1.3f,
    9.                     bottom = 1.3f,
    10.                     deltaX = (right - left) / width,
    11.                     deltaY = (bottom - top) / height,
    12.                     coordinateX = left;
    13.  
    14.             float coordinateY;  // moved variables outside of loops
    15.             float workX = 0;
    16.             float workY = 0;
    17.             int counter = 0;
    18.  
    19.             int x;
    20.             int y;
    21.  
    22.             float newX;
    23.  
    24.             float workX2 = 0; // added variable for square values that are reused.
    25.             float workY2 = 0;
    26.  
    27.             for (int i = 0; i < iterations; i++)
    28.             {              
    29.  
    30.                 for (x = 0; x < width; x++)
    31.                 {
    32.                     coordinateY = top;
    33.  
    34.                     for (y = 0; y < height; y++)
    35.                     {
    36.                         workX = 0;
    37.                         workY = 0;
    38.                         counter = 0;
    39.  
    40.                         workX2 = 0;
    41.                         workY2 = 0;
    42.  
    43.                         while (counter < 255 && Math.Sqrt((workX2 + workY2)) < 2.0f)
    44.                         {
    45.                             counter++;
    46.  
    47.                             newX = (workX2) - (workY2) + coordinateX;
    48.  
    49.                             workY = 2 * workX * workY + coordinateY;
    50.                             workX = newX;
    51.  
    52.                             workX2 = workX * workX;
    53.                             workY2 = workY * workY;
    54.                         }
    55.  
    56.                         data = workX + workY;
    57.                         coordinateY += deltaY;
    58.                     }
    59.  
    60.                     coordinateX += deltaX;
    61.                 }
    62.             }
    63.  
    64.             return data;
    65.         }
    66.     }
     
    webik150 and tigerleapgorge like this.
  11. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Actually I wonder if the Mandelbrot and NBody could gain from working with float2/double3's as it might give the Burst compiler more SIMD options than just using basic variables.
     
    Last edited: Jul 24, 2019
  12. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Since it's implemented as the intrinsic, Burst utilizes built into compiler function for square root.

    Interesting, so nested stack allocations makes it significantly slower, gonna check how it works on my machine and compare it to GCC.

    Absolutely, this will have an impact as well.
     
  13. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
  14. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
  15. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    5,203
    Doing fair benchmarks is hard. The first thing to be clear about is what does the benchmark try to benchmark?

    1. Taking completely unoptimised scalar code and see how well the auto-vectoriser and code optimisations work out.
    2. You want to write general purpose cross platform performant code. Are willing to spend the time to force the compiler to generate good code. As much as possible you are using explicit float4 SOA vectorised code. You want control.
    3. You are a SIMD performance expert, you understand the target platform and it's assembly instructions and want to manually lay out exactly the exact instructions you want to run.


    #1 burst is competitive in against C. There is still lots we can and will do. As you found there is benchmarks slower & faster. Generally it's not a very predictable model. Very hard to find out why something is slow or fast. This is also the domain that most optimisation research in the C/C++ compiler space goes into. The biggest issue in this space for games is that a tiny change to the code can result in massive differences to performance due to non-predictable nature of this approach.

    #2 is what burst is best at. Thats where we currently focus on. To a large extent this comes down to usability of the math library. Making it easy to write SOA SIMD code.

    #3 is something where we want to add architecture specific intrinsics for to make this true

    It seems like the benchmark is all about #1. So essentially it's a benchmark for code that is not written to be optimised. Don't get me wrong, most code out there is written exactly like that, so there is value in a compiler making it as fast as it can. But if you care about performance, thats not how you write code. So a benchmark purely focused on #1 doesn't seem right.
     
    Last edited: Jul 24, 2019
  16. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    @Arowx I intentionally used expensive algorithms for tests similarly to how compiler engineers are doing that. The goal is to get bare numbers that compiler could give you without spending human's energy on optimizations.

    @Joachim_Ante Yes, I absolutely agree. Since the Burst itself is not a general-purpose compiler and at the tip of the iceberg, it's a transpiler essentially which designed for specific use-cases. I'll add more benchmarks that will cover various cases very soon, just for experiments. Thanks.
     
  17. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Added Sieve of Eratosthenes, GCC is 1% faster than Burst:

    (Burst) Sieve of Eratosthenes: 43,449,732 ticks
    (GCC) Sieve of Eratosthenes: 42,965,656 ticks
    (Mono JIT) Sieve of Eratosthenes: 55,741,659 ticks
     
  18. slime73

    slime73

    Joined:
    May 14, 2017
    Posts:
    107
    IL2CPP versions of those benchmarks might be interesting. I guess you'd probably just need to prevent burst compilation of the job code, and build an il2cpp version of the player.
     
  19. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Yes, I was thinking about this. I'm going to provide IL2CPP results very soon after adding tiny Pixar Raytracer.
     
  20. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Attempt at vectorising the NBody simulation...
    Code (CSharp):
    1.  
    2. // NBody
    3.  
    4.     private struct NBody
    5.     {
    6.         public double3 xyz, vxyz;
    7.         public double mass;
    8.     }
    9.  
    10. [BurstCompile]
    11.     private unsafe struct NBodyBurst : IJob
    12.     {
    13.         public uint advancements;
    14.         public double result;
    15.  
    16.         public void Execute()
    17.         {
    18.             result = NBody(advancements);
    19.         }
    20.  
    21.         private double NBody(uint advancements)
    22.         {
    23.             NBody* sun = stackalloc NBody[5];
    24.             NBody* end = sun + 4;
    25.  
    26.             InitializeBodies(sun, end);
    27.             Energy(sun, end);
    28.  
    29.             while (advancements-- > 0)
    30.             {
    31.                 Advance(sun, end, 0.01d);
    32.             }
    33.  
    34.             Energy(sun, end);
    35.  
    36.             return sun[0].xyz.x + sun[0].xyz.y;
    37.         }
    38.  
    39.         [MethodImpl(MethodImplOptions.AggressiveInlining)]
    40.         private void InitializeBodies(NBody* sun, NBody* end)
    41.         {
    42.             const double pi = 3.141592653589793;
    43.             const double solarmass = 4 * pi * pi;
    44.             const double daysPerYear = 365.24;
    45.  
    46.             unchecked
    47.             {
    48.                 sun[1] = new NBody
    49.                 { // Jupiter
    50.                     xyz = new double3(  4.84143144246472090e+00,
    51.                                         -1.16032004402742839e+00,
    52.                                         -1.03622044471123109e-01 ),
    53.                     vxyz = new double3( 1.66007664274403694e-03 * daysPerYear,
    54.                                         7.69901118419740425e-03 * daysPerYear,
    55.                                         -6.90460016972063023e-05 * daysPerYear),
    56.                     mass = 9.54791938424326609e-04 * solarmass
    57.                 };
    58.  
    59.                 sun[2] = new NBody
    60.                 { // Saturn
    61.                     xyz = new double3(8.34336671824457987e+00,
    62.                     4.12479856412430479e+00,
    63.                     -4.03523417114321381e-01),
    64.                     vxyz = new double3(
    65.                     -2.76742510726862411e-03 * daysPerYear,
    66.                     4.99852801234917238e-03 * daysPerYear,
    67.                     2.30417297573763929e-05 * daysPerYear),
    68.                     mass = 2.85885980666130812e-04 * solarmass
    69.                 };
    70.  
    71.                 sun[3] = new NBody
    72.                 { // Uranus
    73.                     xyz = new double3(1.28943695621391310e+01,
    74.                     -1.51111514016986312e+01,
    75.                     -2.23307578892655734e-01),
    76.                     vxyz = new double3(2.96460137564761618e-03 * daysPerYear,
    77.                     2.37847173959480950e-03 * daysPerYear,
    78.                     -2.96589568540237556e-05 * daysPerYear),
    79.                     mass = 4.36624404335156298e-05 * solarmass
    80.                 };
    81.  
    82.                 sun[4] = new NBody
    83.                 { // Neptune
    84.                     xyz = new double3(1.53796971148509165e+01,
    85.                     -2.59193146099879641e+01,
    86.                     1.79258772950371181e-01),
    87.                     vxyz = new double3(2.68067772490389322e-03 * daysPerYear,
    88.                     1.62824170038242295e-03 * daysPerYear,
    89.                     -9.51592254519715870e-05 * daysPerYear),
    90.                     mass = 5.15138902046611451e-05 * solarmass
    91.                 };
    92.  
    93.                 double3 v = new double3();
    94.  
    95.                 for (NBody* planet = sun + 1; planet <= end; ++planet)
    96.                 {
    97.                     double mass = planet->mass;
    98.  
    99.                     v += planet->vxyz * mass;                                    
    100.                 }
    101.  
    102.                 sun->mass = solarmass;
    103.                 sun->vxyz = v / -solarmass;
    104.             }
    105.         }
    106.  
    107.         [MethodImpl(MethodImplOptions.AggressiveInlining)]
    108.         private void Energy(NBody* sun, NBody* end)
    109.         {
    110.             unchecked
    111.             {
    112.                 double e = 0.0;
    113.                 double imass;
    114.                 double3 ixyz, ivxyz;
    115.                 NBody* bj;
    116.                 NBody* bi;
    117.                 double jmass;
    118.                 double3 dxyz;
    119.  
    120.                 for (bi = sun; bi <= end; ++bi)
    121.                 {
    122.                     imass = bi->mass;
    123.                     ixyz = bi->xyz;
    124.                     ivxyz = bi->vxyz;
    125.  
    126.                     e += 0.5 * imass * (math.length(ivxyz) * math.length(ivxyz));
    127.  
    128.                     for (bj = bi + 1; bj <= end; ++bj)
    129.                     {
    130.                         jmass = bj->mass;
    131.  
    132.                         dxyz = ixyz - bj->xyz;
    133.                      
    134.                         e -= imass * jmass / Math.Sqrt(math.dot(dxyz,dxyz));
    135.                     }
    136.                 }
    137.             }
    138.         }
    139.  
    140.         [MethodImpl(MethodImplOptions.AggressiveInlining)]
    141.         private double GetD2(double dx, double dy, double dz)
    142.         {
    143.             double d2 = dx * dx + dy * dy + dz * dz;
    144.  
    145.             return d2 * Math.Sqrt(d2);
    146.         }
    147.  
    148.         [MethodImpl(MethodImplOptions.AggressiveInlining)]
    149.         private void Advance(NBody* sun, NBody* end, double distance)
    150.         {
    151.             unchecked
    152.             {
    153.                 double3 ixyz, ivxyz, dxyz;
    154.                 double jmass, imass, mag;
    155.                 NBody* bi, bj;
    156.  
    157.                 for (bi = sun; bi < end; ++bi)
    158.                 {
    159.                     ixyz = bi->xyz;
    160.                     ivxyz = bi->vxyz;
    161.  
    162.                     imass = bi->mass;
    163.  
    164.                     for (bj = bi + 1; bj <= end; ++bj)
    165.                     {
    166.                         dxyz = bj->xyz - ixyz;
    167.                         jmass = bj->mass;
    168.                         mag = distance / math.lengthsq(dxyz);
    169.  
    170.                         bj->vxyz = bj->vxyz - dxyz * imass * mag;
    171.                         ivxyz = ivxyz + dxyz * jmass * mag;
    172.                     }
    173.  
    174.                     bi->vxyz = ivxyz;
    175.                     bi->xyz = ixyz + ivxyz * distance;
    176.                 }
    177.  
    178.                 end->xyz = end->xyz + end->vxyz * distance;                            
    179.             }
    180.         }
    181.     }
    My theory is that the use of double3 should allow Burst to vectorise more operations??
     
  21. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    It's not guaranteed, vectorization may fail in various circumstances. Usually, a programmer follows particular criteria to achieve auto-vectorization within capabilities of a compiler (nested functions, data dependencies, conditional sentences), and it's beyond of simple replacement of datatype. Burst has its rules for this as well, and it should have a verbose indication of failures.
     
  22. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    (Burst) NBodyV: 69,718,514 ticks -- double3 and moving variables out of loops.
    (Burst) NBody: 84,495,822 ticks
    (Mono JIT) NBody: 457,873,937 ticks
     
  23. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Can you post numbers without moving variables out of loops?
     
  24. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    I'm only moving the variable declarations, it's a basic optimisation all programmers should learn and adopt.
     
    MadeFromPolygons likes this.
  25. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Sure, so what about numbers?
     
    JesOb likes this.
  26. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Added Pixar Raytracer:

    (Burst) Pixar Raytracer: 245,079,625 ticks
    (GCC) Pixar Raytracer: 71,429,484 ticks
    (Mono JIT) Pixar Raytracer: 1,388,610,887 ticks

    pixar.jpg
     
  27. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    (Burst) NBodyV: 69,718,514 ticks -- double3 and moving variables out of loops.
    (Burst) NBody: 84,495,822 ticks
    (Mono JIT) NBody: 457,873,937 ticks
     
  28. Wokky

    Wokky

    Joined:
    Apr 3, 2014
    Posts:
    11
    I was very skeptical about this, so I put together a simple comparison based on the Mandlebrot benchmark.

    Here's a version that moves unnecessary work out of the loop but keeps variable declarations within their relevant scopes:
    Code (CSharp):
    1. float MandelbrotDeclareWithinLoops(uint width, uint height, uint iterations) {
    2.     float data = 0.0f;
    3.  
    4.     const float LEFT = -2.1f;
    5.     const float RIGHT = 1.0f;
    6.     const float TOP = -1.3f;
    7.     const float BOTTOM = 1.3f;
    8.  
    9.     float deltaX = (RIGHT - LEFT) / width;
    10.     float deltaY = (BOTTOM - TOP) / height;
    11.  
    12.     for (int i = 0; i < iterations; i++) {
    13.         // Declaring within loops
    14.         float coordinateX = LEFT;
    15.  
    16.         for (int x = 0; x < width; x++) {
    17.             float coordinateY = TOP;
    18.  
    19.             for (int y = 0; y < height; y++) {
    20.                 float workX = 0;
    21.                 float workY = 0;
    22.                 float counter = 0;
    23.  
    24.                 while (counter < 255 && math.sqrt((workX * workX) + (workY * workY)) < 2.0f) {
    25.                     counter++;
    26.                     float workXSquared = workX * workX;
    27.                     workX = workXSquared - (workY * workY) + coordinateX;
    28.                     workY = 2 * workXSquared + coordinateY;
    29.                 }
    30.  
    31.                 data = workX + workY;
    32.                 coordinateY += deltaY;
    33.             }
    34.  
    35.             coordinateX += deltaX;
    36.         }
    37.     }
    38.  
    39.     return data;
    40. }
    And here's a version where all variables are hoisted out of the loop:
    Code (CSharp):
    1. float MandelbrotHoistedDeclarations(uint width, uint height, uint iterations) {
    2.     float data = 0.0f;
    3.  
    4.     const float LEFT = -2.1f;
    5.     const float RIGHT = 1.0f;
    6.     const float TOP = -1.3f;
    7.     const float BOTTOM = 1.3f;
    8.  
    9.     float deltaX = (RIGHT - LEFT) / width;
    10.     float deltaY = (BOTTOM - TOP) / height;
    11.  
    12.     // Variable declarations hoisted out of loop
    13.     float coordinateX;
    14.     float coordinateY;
    15.     float workX;
    16.     float workY;
    17.     float counter;
    18.     float workXSquared;
    19.  
    20.     for (int i = 0; i < iterations; i++) {
    21.         coordinateX = LEFT;
    22.  
    23.         for (int x = 0; x < width; x++) {
    24.             coordinateY = TOP;
    25.  
    26.             for (int y = 0; y < height; y++) {
    27.                 workX = 0;
    28.                 workY = 0;
    29.                 counter = 0;
    30.  
    31.                 while (counter < 255 && math.sqrt((workX * workX) + (workY * workY)) < 2.0f) {
    32.                     counter++;
    33.                     workXSquared = workX * workX;
    34.                     workX = workXSquared - (workY * workY) + coordinateX;
    35.                     workY = 2 * workXSquared + coordinateY;
    36.                 }
    37.  
    38.                 data = workX + workY;
    39.                 coordinateY += deltaY;
    40.             }
    41.  
    42.             coordinateX += deltaX;
    43.         }
    44.     }
    45.  
    46.     return data;
    47. }
    Burst produces functionally identical assembly output for both versions, which is what I would expect from the majority of compilers in use.
     
  29. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Could you please post the assembly outputs and benchmark the two for comparison?
     
    Last edited: Jul 25, 2019
  30. Wokky

    Wokky

    Joined:
    Apr 3, 2014
    Posts:
    11
    On an additional note, shouldn't you be using
    CompileSynchronously = true
    with these benchmarks?
     
    nxrighthere likes this.
  31. Wokky

    Wokky

    Joined:
    Apr 3, 2014
    Posts:
    11
    It's a lot of text to embed in a forum post, but if you don't believe me it's pretty easy to copy it into a project, open the Burst Inspector and see for yourself.
     
    eizenhorn likes this.
  32. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    That's what the spoiler forum tags are for.
     
    Nothke likes this.
  33. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    5,203
    CompileSynchronously = true needs to be there for sure. On top of that there should be a warm up iteration that does not get measured. There is a real cost to running something for the first time, we have to extract & cache job reflection data on first run. CompileSync = true + one warm up iteration fixes that and is what we use in our own benchmarks.

    Why are you using doubles instead of float in the nbody simulation? If you care about performance double is used very rarely in games, only in places that are not performance sensitive.
     
    Last edited: Jul 25, 2019
  34. Wokky

    Wokky

    Joined:
    Apr 3, 2014
    Posts:
    11
    .text
    .intel_syntax noprefix
    .file "main"
    .section .rodata.cst4,"aM",@progbits,4
    .p2align 2
    .LCPI0_0:
    .long 1078355558
    .LCPI0_1:
    .long 1076258406
    .LCPI0_2:
    .long 3221644902
    .LCPI0_3:
    .long 3215353446
    .LCPI0_4:
    .long 3204448256
    .LCPI0_5:
    .long 3225419776
    .LCPI0_6:
    .long 1073741824
    .text
    .globl "Unity.Jobs.IJobExtensions.JobStruct`1<Benchmarks.MandlebrotA>.Execute(ref Benchmarks.MandlebrotA data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_401C57F7CC7F16AB"
    .p2align 4, 0x90
    .type "Unity.Jobs.IJobExtensions.JobStruct`1<Benchmarks.MandlebrotA>.Execute(ref Benchmarks.MandlebrotA data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_401C57F7CC7F16AB",@function
    "Unity.Jobs.IJobExtensions.JobStruct`1<Benchmarks.MandlebrotA>.Execute(ref Benchmarks.MandlebrotA data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_401C57F7CC7F16AB":
    push rsi
    sub rsp, 176
    movaps xmmword ptr [rsp + 160], xmm15
    movaps xmmword ptr [rsp + 144], xmm14
    movaps xmmword ptr [rsp + 128], xmm13
    movaps xmmword ptr [rsp + 112], xmm12
    movaps xmmword ptr [rsp + 96], xmm11
    movaps xmmword ptr [rsp + 80], xmm10
    movaps xmmword ptr [rsp + 64], xmm9
    movaps xmmword ptr [rsp + 48], xmm8
    movaps xmmword ptr [rsp + 32], xmm7
    movaps xmmword ptr [rsp + 16], xmm6
    mov r8d, dword ptr [rcx + 8]
    test r8, r8
    xorps xmm1, xmm1
    je .LBB0_19
    mov r10d, dword ptr [rcx]
    test r10d, r10d
    je .LBB0_2
    mov eax, dword ptr [rcx + 4]
    test eax, eax
    je .LBB0_15
    cvtsi2ss xmm0, r10
    movabs rdx, offset .LCPI0_0
    movss xmm1, dword ptr [rdx]
    divss xmm1, xmm0
    movss dword ptr [rsp + 12], xmm1
    xorps xmm0, xmm0
    cvtsi2ss xmm0, rax
    movabs rdx, offset .LCPI0_1
    movss xmm11, dword ptr [rdx]
    divss xmm11, xmm0
    xor r9d, r9d
    movabs rdx, offset .LCPI0_2
    movss xmm0, dword ptr [rdx]
    movss dword ptr [rsp + 8], xmm0
    movabs rdx, offset .LCPI0_3
    movss xmm10, dword ptr [rdx]
    movabs rdx, offset .LCPI0_4
    movss xmm13, dword ptr [rdx]
    movabs rdx, offset .LCPI0_5
    movss xmm14, dword ptr [rdx]
    xorps xmm12, xmm12
    movabs rdx, offset .LCPI0_6
    movss xmm15, dword ptr [rdx]
    .p2align 4, 0x90
    .LBB0_6:
    movss xmm7, dword ptr [rsp + 8]
    xor r11d, r11d
    .p2align 4, 0x90
    .LBB0_7:
    xor edx, edx
    movaps xmm4, xmm10
    .p2align 4, 0x90
    .LBB0_8:
    xorps xmm1, xmm1
    mov esi, -1
    xorps xmm8, xmm8
    .p2align 4, 0x90
    .LBB0_9:
    movaps xmm0, xmm1
    mulss xmm0, xmm0
    movaps xmm6, xmm8
    mulss xmm6, xmm6
    movaps xmm5, xmm6
    addss xmm5, xmm0
    xorps xmm3, xmm3
    rsqrtss xmm3, xmm5
    movaps xmm2, xmm5
    mulss xmm2, xmm3
    movaps xmm9, xmm2
    mulss xmm9, xmm13
    mulss xmm2, xmm3
    addss xmm2, xmm14
    mulss xmm2, xmm9
    cmpeqss xmm5, xmm12
    andnps xmm5, xmm2
    ucomiss xmm5, xmm15
    jae .LBB0_11
    movaps xmm1, xmm7
    subss xmm1, xmm6
    addss xmm1, xmm0
    addss xmm0, xmm0
    addss xmm0, xmm4
    inc esi
    cmp esi, 253
    movaps xmm8, xmm0
    jbe .LBB0_9
    .LBB0_11:
    addss xmm4, xmm11
    inc edx
    movsxd rsi, edx
    cmp rsi, rax
    jl .LBB0_8
    addss xmm7, dword ptr [rsp + 12]
    inc r11d
    movsxd rdx, r11d
    cmp rdx, r10
    jl .LBB0_7
    inc r9d
    movsxd rdx, r9d
    cmp rdx, r8
    jl .LBB0_6
    addss xmm1, xmm8
    jmp .LBB0_19
    .LBB0_2:
    mov eax, 1
    .p2align 4, 0x90
    .LBB0_3:
    movsxd rdx, eax
    lea eax, [rdx + 1]
    cmp rdx, r8
    jl .LBB0_3
    jmp .LBB0_19
    .LBB0_15:
    xor eax, eax
    .p2align 4, 0x90
    .LBB0_16:
    mov edx, 1
    .p2align 4, 0x90
    .LBB0_17:
    movsxd rsi, edx
    lea edx, [rsi + 1]
    cmp rsi, r10
    jl .LBB0_17
    inc eax
    movsxd rdx, eax
    cmp rdx, r8
    jl .LBB0_16
    .LBB0_19:
    movss dword ptr [rcx + 12], xmm1
    movaps xmm6, xmmword ptr [rsp + 16]
    movaps xmm7, xmmword ptr [rsp + 32]
    movaps xmm8, xmmword ptr [rsp + 48]
    movaps xmm9, xmmword ptr [rsp + 64]
    movaps xmm10, xmmword ptr [rsp + 80]
    movaps xmm11, xmmword ptr [rsp + 96]
    movaps xmm12, xmmword ptr [rsp + 112]
    movaps xmm13, xmmword ptr [rsp + 128]
    movaps xmm14, xmmword ptr [rsp + 144]
    movaps xmm15, xmmword ptr [rsp + 160]
    add rsp, 176
    pop rsi
    ret
    .Lfunc_end0:
    .size "Unity.Jobs.IJobExtensions.JobStruct`1<Benchmarks.MandlebrotA>.Execute(ref Benchmarks.MandlebrotA data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_401C57F7CC7F16AB", .Lfunc_end0-"Unity.Jobs.IJobExtensions.JobStruct`1<Benchmarks.MandlebrotA>.Execute(ref Benchmarks.MandlebrotA data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_401C57F7CC7F16AB"

    .globl burst.initialize
    .p2align 4, 0x90
    .type burst.initialize,@function
    burst.initialize:
    ret
    .Lfunc_end1:
    .size burst.initialize, .Lfunc_end1-burst.initialize


    .section ".note.GNU-stack","",@progbits

    .text
    .intel_syntax noprefix
    .file "main"
    .section .rodata.cst4,"aM",@progbits,4
    .p2align 2
    .LCPI0_0:
    .long 1078355558
    .LCPI0_1:
    .long 1076258406
    .LCPI0_2:
    .long 3221644902
    .LCPI0_3:
    .long 3215353446
    .LCPI0_4:
    .long 3204448256
    .LCPI0_5:
    .long 3225419776
    .LCPI0_6:
    .long 1073741824
    .text
    .globl "Unity.Jobs.IJobExtensions.JobStruct`1<Benchmarks.MandelbrotB>.Execute(ref Benchmarks.MandelbrotB data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_45973D4CD709E184"
    .p2align 4, 0x90
    .type "Unity.Jobs.IJobExtensions.JobStruct`1<Benchmarks.MandelbrotB>.Execute(ref Benchmarks.MandelbrotB data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_45973D4CD709E184",@function
    "Unity.Jobs.IJobExtensions.JobStruct`1<Benchmarks.MandelbrotB>.Execute(ref Benchmarks.MandelbrotB data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_45973D4CD709E184":
    push rsi
    sub rsp, 176
    movaps xmmword ptr [rsp + 160], xmm15
    movaps xmmword ptr [rsp + 144], xmm14
    movaps xmmword ptr [rsp + 128], xmm13
    movaps xmmword ptr [rsp + 112], xmm12
    movaps xmmword ptr [rsp + 96], xmm11
    movaps xmmword ptr [rsp + 80], xmm10
    movaps xmmword ptr [rsp + 64], xmm9
    movaps xmmword ptr [rsp + 48], xmm8
    movaps xmmword ptr [rsp + 32], xmm7
    movaps xmmword ptr [rsp + 16], xmm6
    mov r8d, dword ptr [rcx + 8]
    test r8, r8
    xorps xmm1, xmm1
    je .LBB0_19
    mov r10d, dword ptr [rcx]
    test r10d, r10d
    je .LBB0_2
    mov eax, dword ptr [rcx + 4]
    test eax, eax
    je .LBB0_15
    cvtsi2ss xmm0, r10
    movabs rdx, offset .LCPI0_0
    movss xmm1, dword ptr [rdx]
    divss xmm1, xmm0
    movss dword ptr [rsp + 12], xmm1
    xorps xmm0, xmm0
    cvtsi2ss xmm0, rax
    movabs rdx, offset .LCPI0_1
    movss xmm11, dword ptr [rdx]
    divss xmm11, xmm0
    xor r9d, r9d
    movabs rdx, offset .LCPI0_2
    movss xmm0, dword ptr [rdx]
    movss dword ptr [rsp + 8], xmm0
    movabs rdx, offset .LCPI0_3
    movss xmm10, dword ptr [rdx]
    movabs rdx, offset .LCPI0_4
    movss xmm13, dword ptr [rdx]
    movabs rdx, offset .LCPI0_5
    movss xmm14, dword ptr [rdx]
    xorps xmm12, xmm12
    movabs rdx, offset .LCPI0_6
    movss xmm15, dword ptr [rdx]
    .p2align 4, 0x90
    .LBB0_6:
    movss xmm7, dword ptr [rsp + 8]
    xor r11d, r11d
    .p2align 4, 0x90
    .LBB0_7:
    xor edx, edx
    movaps xmm2, xmm10
    .p2align 4, 0x90
    .LBB0_8:
    xorps xmm1, xmm1
    mov esi, -1
    xorps xmm8, xmm8
    .p2align 4, 0x90
    .LBB0_9:
    movaps xmm0, xmm1
    mulss xmm0, xmm0
    movaps xmm6, xmm8
    mulss xmm6, xmm6
    movaps xmm5, xmm6
    addss xmm5, xmm0
    xorps xmm3, xmm3
    rsqrtss xmm3, xmm5
    movaps xmm4, xmm5
    mulss xmm4, xmm3
    movaps xmm9, xmm4
    mulss xmm9, xmm13
    mulss xmm4, xmm3
    addss xmm4, xmm14
    mulss xmm4, xmm9
    cmpeqss xmm5, xmm12
    andnps xmm5, xmm4
    ucomiss xmm5, xmm15
    jae .LBB0_11
    movaps xmm1, xmm7
    subss xmm1, xmm6
    addss xmm1, xmm0
    addss xmm0, xmm0
    addss xmm0, xmm2
    inc esi
    cmp esi, 253
    movaps xmm8, xmm0
    jbe .LBB0_9
    .LBB0_11:
    addss xmm2, xmm11
    inc edx
    movsxd rsi, edx
    cmp rsi, rax
    jl .LBB0_8
    addss xmm7, dword ptr [rsp + 12]
    inc r11d
    movsxd rdx, r11d
    cmp rdx, r10
    jl .LBB0_7
    inc r9d
    movsxd rdx, r9d
    cmp rdx, r8
    jl .LBB0_6
    addss xmm1, xmm8
    jmp .LBB0_19
    .LBB0_2:
    mov eax, 1
    .p2align 4, 0x90
    .LBB0_3:
    movsxd rdx, eax
    lea eax, [rdx + 1]
    cmp rdx, r8
    jl .LBB0_3
    jmp .LBB0_19
    .LBB0_15:
    xor eax, eax
    .p2align 4, 0x90
    .LBB0_16:
    mov edx, 1
    .p2align 4, 0x90
    .LBB0_17:
    movsxd rsi, edx
    lea edx, [rsi + 1]
    cmp rsi, r10
    jl .LBB0_17
    inc eax
    movsxd rdx, eax
    cmp rdx, r8
    jl .LBB0_16
    .LBB0_19:
    movss dword ptr [rcx + 12], xmm1
    movaps xmm6, xmmword ptr [rsp + 16]
    movaps xmm7, xmmword ptr [rsp + 32]
    movaps xmm8, xmmword ptr [rsp + 48]
    movaps xmm9, xmmword ptr [rsp + 64]
    movaps xmm10, xmmword ptr [rsp + 80]
    movaps xmm11, xmmword ptr [rsp + 96]
    movaps xmm12, xmmword ptr [rsp + 112]
    movaps xmm13, xmmword ptr [rsp + 128]
    movaps xmm14, xmmword ptr [rsp + 144]
    movaps xmm15, xmmword ptr [rsp + 160]
    add rsp, 176
    pop rsi
    ret
    .Lfunc_end0:
    .size "Unity.Jobs.IJobExtensions.JobStruct`1<Benchmarks.MandelbrotB>.Execute(ref Benchmarks.MandelbrotB data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_45973D4CD709E184", .Lfunc_end0-"Unity.Jobs.IJobExtensions.JobStruct`1<Benchmarks.MandelbrotB>.Execute(ref Benchmarks.MandelbrotB data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_45973D4CD709E184"

    .globl burst.initialize
    .p2align 4, 0x90
    .type burst.initialize,@function
    burst.initialize:
    ret
    .Lfunc_end1:
    .size burst.initialize, .Lfunc_end1-burst.initialize


    .section ".note.GNU-stack","",@progbits

    Spot the difference :)
     
    nxrighthere likes this.
  35. xoofx

    xoofx

    Unity Technologies

    Joined:
    Nov 5, 2016
    Posts:
    416
    Thanks for this benchmark @nxrighthere, that's cool!

    So without looking deep into the code, I just have two preamble remarks:

    1. You should make sure that you job in burst is using fast calculation (as you setup it for GCC), otherwise codegen wise I believe that it is going to be a significant disadvantage for burst (so setup via
    Code (CSharp):
    1. [BurstCompile(FloatPrecision.Standard, FloatMode.Fast)]
    )
    2. The comparison between GCC 8.1.0 (a relatively recent version) can also scramble a bit the results with Burst which is using Clang 6.0 (old clang, but we have a plan to upgrade to 8.x in the coming weeks)
     
    eizenhorn and nxrighthere like this.
  36. xoofx

    xoofx

    Unity Technologies

    Joined:
    Nov 5, 2016
    Posts:
    416
    (So correlated to 2., it would be interesting to see how Clang performs on the same C code for example)
     
    nxrighthere likes this.
  37. xoofx

    xoofx

    Unity Technologies

    Joined:
    Nov 5, 2016
    Posts:
    416
    Also as stated above by @Joachim_Ante you should warmup run your burst job without measuring it to make sure that the test doesn't count burst compilation, so usually prefer compile synchronously, and warmup
     
    kalineh and nxrighthere like this.
  38. xoofx

    xoofx

    Unity Technologies

    Joined:
    Nov 5, 2016
    Posts:
    416
    Just emphasis why "fast" mode (remark 1. above) can make a huge difference:

    Fast allows the compiler to reorder floating point instructions, by loosing floating point strictness in favor of speed, and this alone in burst/LLVM or GCC brings a lot more optimization opportunities for the compiler: collapse many instructions including vectorization across instructions (It is called SLP vectorizer in LLVM) and potentially loop vectorization (but I believe that in the case of Raycaster, without [NoAlias] hints on pointers arguments in the code, you should not have any)
     
    Last edited: Jul 25, 2019
    kalineh and nxrighthere like this.
  39. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    @Wokky @Joachim_Ante @xoofx Ah, thanks for all the notes! Indeed, CompileSynchronously = true, warm-up, and float mode is something that I was wasn't aware (not very familiar with Burst options at the moment), will address this quickly.

    As for double utilization in the NBody, yes, I agree, just did that out of curiosity, performance-wise float should be used instead for sure.

    Clang and IL2CPP will be added very soon for the sake of completeness.
     
    JesOb and Shinyclef like this.
  40. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Done, you can find the diff with numbers here. Noticeable changes: FloatPrecision.Standard, FloatMode.Fast improved the Mandelbrot performance by 52%, it's now very close to GCC. Fast float mode makes the raytracer up to 91% slower than default one on my machine, so I haven't used it there. Everything else remains almost the same. NBody with double-precision floating-point unaffected.
     
    Last edited: Jul 25, 2019
    Shinyclef and Deleted User like this.
  41. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    I'm seeing if I can improve the Raytracing results, as it's code base is designed to be minimal and the opposite of a good benchmark IMHO.

    Initial results
    (Burst) Pixar Raytracer: 167,786,775 ticks
    After some tweaks
    (Burst) Pixar Raytracer: 160,050,339 ticks

    Tried converting to float4 for maximum Burst but actually got a surprise...
    (Burst) Pixar Raytracer: 161,807,136 ticks

    It was slower than float3...?
     
    Shinyclef likes this.
  42. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Code (CSharp):
    1. [BurstCompile(FloatPrecision.Standard, FloatMode.Fast)]
    Produces a big drop in performance for me e.g.
    (Burst) Pixar Raytracer
    149,860,110 ticks [BurstCompile]
    >222,000,000 ticks [BurstCompile(FloatPrecision.Standard, FloatMode.Fast)]

    (Burst) Pixar Raytracer: 138,261,278 ticks [Pre-warmed]
     
    Last edited: Jul 25, 2019
    Shinyclef likes this.
  43. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    It looks like this is a bug in your code as it attempts to address the -1 index of the letters array in the Raytracer...
    Code (CSharp):
    1. Vector begin = MultiplyFloat(new Vector { x = letters[i] - 79.0f, y = letters[i - 1] - 79.0f, z = 0.0f }, 0.5f);
     
    nxrighthere likes this.
  44. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
  45. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Added Clang (LLVM 8) and IL2CPP (Unity 2019.1.11f1).

    So, LLVM brother in some tests is slower than Burst but significantly faster in the raytracer since fast math optimizations are works properly there, almost as fast as GCC.

    @xoofx You guys are using the original SLEEF for vectorized math? I wonder whether it's worth to link it natively and test against Unity's built-in math?
     
    Last edited: Jul 26, 2019
  46. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    I think it would be better to sort benchmarks by integer/float/double categories and make 5 different algorithms/simulations for each (except double-precision). :rolleyes:
     
  47. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    My attempt at optimising the PixarRaytracer...

    Code (CSharp):
    1. [BurstCompile(CompileSynchronously = true)]
    2.     //[BurstCompile(FloatPrecision.Standard, FloatMode.Fast)] // this is slower for me then just BurstCompile
    3.     private unsafe struct PixarRaytracerBurst : IJob
    4.     {
    5.         public uint width;
    6.         public uint height;
    7.         public uint samples;
    8.         public float result;
    9.  
    10.         public NativeArray<float4> lettersBegin;
    11.         public NativeArray<float4> lettersE;
    12.  
    13.         public void Execute()
    14.         {
    15.             result = PixarRaytracer(width, height, samples, lettersBegin, lettersE);
    16.         }
    17.  
    18.         private uint marsagliaZ, marsagliaW;
    19.  
    20.         private float PixarRaytracer(uint width, uint height, uint samples, NativeArray<float4> lb, NativeArray<float4> le)
    21.         {
    22.             lettersBegin = lb;
    23.             lettersE = le;
    24.  
    25.             marsagliaZ = 666;
    26.             marsagliaW = 999;
    27.  
    28.             float4 position = new float4 { x = -22.0f, y = 5.0f, z = 25.0f };
    29.             float4 goal = new float4 { x = -3.0f, y = 4.0f, z = 0.0f };
    30.  
    31.             goal = Inverse(goal) + (position * -1.0f);
    32.  
    33.             float4 left = new float4 { x = goal.z, y = 0, z = goal.x };
    34.  
    35.             left = Inverse(left) * (1.0f / width);
    36.  
    37.             float4 up = new float4(math.cross(goal.xyz, left.xyz), 0f);
    38.             float4 color = new float4();
    39.             float4 o = new float4();
    40.  
    41.             uint y;
    42.             uint x;
    43.             uint p;
    44.  
    45.             float colorPlus = 14.0f / 241.0f;
    46.             float invSamples = 1.0f / samples;
    47.  
    48.             float4 goalleft = goal + left;
    49.             float xwidth;
    50.             float halfwidth = width / 2f;
    51.             float yheight;
    52.             float halfheight = height / 2f;
    53.  
    54.             for (y = height; y > 0; y--)
    55.             {
    56.                 yheight = y - halfheight;
    57.  
    58.                 for (x = width; x > 0; x--)
    59.                 {
    60.                     xwidth = x - halfwidth;
    61.  
    62.                     for (p = samples; p > 0; p--)
    63.                     {
    64.                         color = (color + Trace(position, (Inverse(((goalleft) * (xwidth + Random()))) + (up * (yheight + Random())))));
    65.                     }
    66.  
    67.                     color = (color * invSamples + colorPlus);
    68.                     o = color + 1.0f;
    69.                     color = color / o;
    70.  
    71.                     color = color * 255.0f;
    72.                 }
    73.             }
    74.  
    75.             return color.x + color.y + color.z;
    76.         }
    77.  
    78.         /*
    79.         [MethodImpl(MethodImplOptions.AggressiveInlining)]
    80.         private float3 Multiply(float3 left, float3 right)
    81.         {
    82.             left.x *= right.x;
    83.             left.y *= right.y;
    84.             left.z *= right.z;
    85.  
    86.             return left;
    87.         }
    88.  
    89.         [MethodImpl(MethodImplOptions.AggressiveInlining)]
    90.         private float3 MultiplyFloat(float3 vector, float value)
    91.         {
    92.             vector.x *= value;
    93.             vector.y *= value;
    94.             vector.z *= value;
    95.  
    96.             return vector;
    97.         }
    98.  
    99.         [MethodImpl(MethodImplOptions.AggressiveInlining)]
    100.         private float Modulus(float3 left, float3 right)
    101.         {
    102.             return left.x * right.x + left.y * right.y + left.z * right.z;
    103.         }
    104.  
    105.         [MethodImpl(MethodImplOptions.AggressiveInlining)]
    106.         private float ModulusSelf(float3 vector)
    107.         {
    108.             return vector.x * vector.x + vector.y * vector.y + vector.z * vector.z;
    109.         }
    110.         */
    111.  
    112.         [MethodImpl(MethodImplOptions.AggressiveInlining)]
    113.         private float4 Inverse(float4 vector)
    114.         {
    115.             return vector * 1f / math.length(vector);
    116.         }
    117.  
    118.         /*
    119.         [MethodImpl(MethodImplOptions.AggressiveInlining)]
    120.         private float3 Add(float3 left, float3 right)
    121.         {
    122.             left.x += right.x;
    123.             left.y += right.y;
    124.             left.z += right.z;
    125.  
    126.             return left;
    127.         }
    128.  
    129.         [MethodImpl(MethodImplOptions.AggressiveInlining)]
    130.         private float3 AddFloat(float3 vector, float value)
    131.         {
    132.             vector.x += value;
    133.             vector.y += value;
    134.             vector.z += value;
    135.  
    136.             return vector;
    137.         }
    138.  
    139.         [MethodImpl(MethodImplOptions.AggressiveInlining)]
    140.         private float3 Cross(float3 to, float3 from)
    141.         {
    142.             to.x *= from.z - to.z * from.y;
    143.             to.y *= from.x - to.x * from.z;
    144.             to.z *= from.y - to.y * from.x;
    145.  
    146.             return to;
    147.         }
    148.         */
    149.  
    150.         [MethodImpl(MethodImplOptions.AggressiveInlining)]
    151.         private float Min(float left, float right)
    152.         {
    153.             return left < right ? left : right;
    154.         }
    155.        
    156.         [MethodImpl(MethodImplOptions.AggressiveInlining)]
    157.         private float BoxTest(float4 position, float4 lowerLeft, float4 upperRight)
    158.         {
    159.             lowerLeft = (position + lowerLeft) * -1;
    160.             upperRight = ((upperRight + position) * -1);
    161.  
    162.             return -Min(Min(Min(lowerLeft.x, upperRight.x), Min(lowerLeft.y, upperRight.y)), Min(lowerLeft.z, upperRight.z));
    163.         }
    164.  
    165.         [MethodImpl(MethodImplOptions.AggressiveInlining)]
    166.         private float Random()
    167.         {
    168.             marsagliaZ = 36969 * (marsagliaZ & 65535) + (marsagliaZ >> 16);
    169.             marsagliaW = 18000 * (marsagliaW & 65535) + (marsagliaW >> 16);
    170.  
    171.             return (((marsagliaZ << 16) + marsagliaW) + 1.0f) * 3.141592653589793f;
    172.         }
    173.      
    174.  
    175.         [MethodImpl(MethodImplOptions.AggressiveInlining)]
    176.         private float Sample(float4 position, int* hitType)
    177.         {
    178.             const int size = 15 * 4;
    179.  
    180.             float distance = 1e9f;
    181.             float4 f = position;
    182.            
    183.             f.z = 0.0f;
    184.  
    185.             float4 begin;
    186.             float4 e;
    187.             float4 o;
    188.  
    189.             for (int i = 0; i < size; i += 4)
    190.             {
    191.                 begin = lettersBegin[i];
    192.                 e = lettersE[i];
    193.                 o =((f + ((begin + e) * Min(-Min(math.dot(((begin + f) * -1.0f), e) / math.dot(e,e), 0.0f), 1.0f)))* -1.0f);
    194.  
    195.                 //Vector o = MultiplyFloat(Add(f, MultiplyFloat(Add(begin, e), Min(-Min(Modulus(MultiplyFloat(Add(begin, f), -1.0f), e) / ModulusSelf(e), 0.0f), 1.0f))), -1.0f);
    196.  
    197.                 distance = Min(distance, math.dot(o,o));
    198.             }
    199.  
    200.             distance = math.sqrt(distance);
    201.  
    202.             float4* curves = stackalloc float4[2];
    203.  
    204.             curves[0] = new float4 { x = -11.0f, y = 6.0f, z = 0.0f };
    205.             curves[1] = new float4 { x = 11.0f, y = 6.0f, z = 0.0f };
    206.  
    207.             float m;
    208.  
    209.             for (int i = 2; i > 0; i--)
    210.             {
    211.                 o = (f + (curves[i] * -1.0f));
    212.                 m = 0.0f;
    213.  
    214.                 if (o.x > 0.0f)
    215.                 {
    216.                     m = math.abs(math.length(o) - 2.0f);
    217.                 }
    218.                 else
    219.                 {
    220.                     if (o.y > 0.0f)
    221.                         o.y += -2.0f;
    222.                     else
    223.                         o.y += 2.0f;
    224.  
    225.                     o.y += math.length(o);
    226.                 }
    227.  
    228.                 distance = Min(distance, m);
    229.             }
    230.  
    231.             distance = math.pow(math.pow(distance, 8.0f) + math.pow(position.z, 8.0f), 0.125f) - 0.5f;
    232.             *hitType = (int)Hit.Letter;
    233.  
    234.  
    235.  
    236.             float roomDistance = Min(-Min(
    237.                 BoxTest(position,
    238.                     new float4 { x = -30.0f, y = -0.5f, z = -30.0f },
    239.                     new float4 { x = 30.0f, y = 18.0f, z = 30.0f }),
    240.                 BoxTest(position,
    241.                     new float4 { x = -25.0f, y = -17.5f, z = -25.0f },
    242.                     new float4 { x = 25.0f, y = 20.0f, z = 25.0f })),
    243.                 BoxTest(new float4 { x = math.fmod(math.abs(position.x), 8), y = position.y, z = position.z },
    244.                     new float4 { x = 1.5f, y = 18.5f, z = -25.0f },
    245.                     new float4 { x = 6.5f, y = 20.0f, z = 25.0f }));
    246.  
    247.             if (roomDistance < distance)
    248.             {
    249.                 distance = roomDistance;
    250.                 *hitType = (int)Hit.Wall;
    251.             }
    252.  
    253.             float sun = 19.9f - position.y;
    254.  
    255.             if (sun < distance)
    256.             {
    257.                 distance = sun;
    258.                 *hitType = (int)Hit.Sun;
    259.             }
    260.  
    261.             return distance;
    262.         }
    263.  
    264.         [MethodImpl(MethodImplOptions.AggressiveInlining)]
    265.         private int RayMarching(float4 origin, float4 direction, float4* hitPosition, float4* hitNormal)
    266.         {
    267.             int hitType = (int)Hit.None;
    268.             int noHitCount = 0;
    269.             float distance = 0.0f;
    270.  
    271.             float4 offsetX = new float4 { x = 0.01f, y = 0.0f, z = 0.0f };
    272.             float4 offsetY = new float4 { x = 0.0f, y = 0.01f, z = 0.0f };
    273.             float4 offsetZ = new float4 { x = 0.0f, y = 0.0f, z = 0.01f };
    274.  
    275.  
    276.             for (float i = 0; i < 100; i += distance)
    277.             {
    278.                 *hitPosition = (origin + direction)* i;
    279.                 distance = Sample(*hitPosition, &hitType);
    280.  
    281.                 if (distance < 0.01f || ++noHitCount > 99)
    282.                 {
    283.                     *hitNormal = Inverse(new float4 {
    284.                         x = Sample((*hitPosition + offsetX), &noHitCount) - distance,
    285.                         y = Sample((*hitPosition + offsetY), &noHitCount) - distance,
    286.                         z = Sample((*hitPosition + offsetZ), &noHitCount) - distance });
    287.  
    288.                     return hitType;
    289.                 }
    290.             }
    291.  
    292.             return (int)Hit.None;
    293.         }
    294.  
    295.         [MethodImpl(MethodImplOptions.AggressiveInlining)]
    296.         private float4 Trace(float4 origin, float4 direction)
    297.         {
    298.             float4 sampledPosition = new float4 { x = 1.0f, y = 1.0f, z = 1.0f };
    299.             float4 normal = new float4 { x = 1.0f, y = 1.0f, z = 1.0f };
    300.             float4 color = new float4 { x = 1.0f, y = 1.0f, z = 1.0f };
    301.             float4 attenuation = new float4 { x = 1.0f, y = 1.0f, z = 1.0f };
    302.             float4 lightDirection = Inverse(new float4 { x = 0.6f, y = 0.6f, z = 1.0f });
    303.  
    304.             float incidence;
    305.             float p;
    306.             float c;
    307.             float s;
    308.             float g;
    309.             float u;
    310.             float v;          
    311.  
    312.             Hit hitType;
    313.  
    314.             for (int bounceCount = 3; bounceCount > 0; bounceCount--)
    315.             {
    316.                 hitType = (Hit)RayMarching(origin, direction, &sampledPosition, &normal);
    317.  
    318.                 switch (hitType)
    319.                 {
    320.                     case Hit.None:
    321.                         break;
    322.  
    323.                     case Hit.Letter:
    324.                         {
    325.                             direction = ((direction + normal) * math.dot(normal, direction) * -2.0f);
    326.                             origin = ((sampledPosition + direction)* 0.1f);
    327.                             attenuation = (attenuation * 0.2f);
    328.  
    329.                             break;
    330.                         }
    331.  
    332.                     case Hit.Wall:
    333.                         {
    334.                             incidence = math.dot(normal, lightDirection);
    335.                             p = 6.283185f * Random();
    336.                             c = Random();
    337.                             s = math.sqrt(1.0f - c);
    338.                             g = normal.z < 0 ? -1.0f : 1.0f;
    339.                             u = -1.0f / (g + normal.z);
    340.                             v = normal.x * normal.y * u;                          
    341.  
    342.                             direction = ((new float4 { x = v, y = g + normal.y * normal.y * u, z = -normal.y * (math.cos(p) * s) }+
    343.                                           new float4 { x = 1.0f + g * normal.x * normal.x * u, y = g * v, z = -g * normal.x })
    344.                                           + (normal* math.sqrt(c)));
    345.  
    346.                             origin = ((sampledPosition+ direction)* 0.1f);
    347.                             attenuation = (attenuation * 0.2f);
    348.  
    349.                             if (incidence > 0 && RayMarching(((sampledPosition + normal)* 0.1f), lightDirection, &sampledPosition, &normal) == (int)Hit.Sun)
    350.                                 color = (((color + attenuation) * new float4 { x = 500.0f, y = 400.0f, z = 100.0f })* incidence);
    351.  
    352.                             break;
    353.                         }
    354.  
    355.                     case Hit.Sun:
    356.                         {
    357.                             color = ((color + attenuation) * new float4 { x = 50.0f, y = 80.0f, z = 100.0f });
    358.  
    359.                             goto escape;
    360.                         }
    361.                 }
    362.             }
    363.  
    364.         escape:
    365.  
    366.             return color;
    367.         }
    368.     }
    You now need to launch it like this:

    Code (CSharp):
    1. {
    2.             NativeArray<float4> lb;
    3.             NativeArray<float4> le;
    4.  
    5.             SetupLetters(out lb, out le); // pre-calculating letter data to vector information.
    6.  
    7.             var pixarRaytracerBurst = new PixarRaytracerBurst
    8.             {
    9.                 width = 720,
    10.                 height = 480,
    11.                 samples = pixarRaytracer,
    12.                 lettersBegin = lb,
    13.                 lettersE = le
    14.             };
    15.  
    16.             pixarRaytracerBurst.Run();
    17.  
    18.             stopwatch.Restart();
    19.             pixarRaytracerBurst.Run();
    20.  
    21.             time = stopwatch.ElapsedTicks;
    22.  
    23.             Debug.Log("(Burst) Pixar Raytracer: " + time + " ticks");
    24.         }
    Note the changing out of Vector to float4 the conversion of a lot of inline functions to math.<functions> and probably the largest change the addition of NativeArrays to store the Sample functions letters pre-calculated vectors.

    Score before changes:

    (Burst) Pixar Raytracer: 167,867,050 ticks

    Score after changes:

    (Burst) Pixar Raytracer: 138,667,282 ticks

    Even with these changes I still think this is a poor benchmarking algorithm due to the Sample function constantly re-calculating 'ray' to letter proximity without any spatial optimisation.
     
    tigerleapgorge likes this.
  48. xoofx

    xoofx

    Unity Technologies

    Joined:
    Nov 5, 2016
    Posts:
    416
    This is not important for comparing codegen between compiler on the same codebase. The point here is less about making the faster ray calculation code but to compare how a simple code behaves in burst and other compilers. I would like to compare - almost - apple to apple here, where we don't change burst C# code in regard to the C version. That way, we can track anything that could be different between our codegen for C# and C++

    This is surprising. We will have a look. It could be that we might have a bug here in burst

    We have around in burst codebase a benchmark suite for comparing C# vs C++ code, but we never had a chance to work on practical cases, so @nxrighthere, if you don't mind, we will gladly integrate your benchmarks as part of our suite. This is exactly the kind of benchmarks that were meant to be put in our suite.
     
  49. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    It would be a pleasure, I'm happy to provide even more tests.
     
    MadeFromPolygons, RaL and Shinyclef like this.
  50. Shinyclef

    Shinyclef

    Joined:
    Nov 20, 2013
    Posts:
    502
    These tests are fantastic. I'm going to have to find some time at some point to read through more careful and understand the key learnings about what bust likes and doesn't like.
     
    nxrighthere likes this.
Thread Status:
Not open for further replies.