Search Unity

ECS Jobs are surprisingly slow without Burst

Discussion in 'Data Oriented Technology Stack' started by Mike37, Aug 17, 2019.

  1. Mike37

    Mike37

    Joined:
    Oct 21, 2018
    Posts:
    21
    For some reason, I'm finding ECS jobs without Burst are nearly 20x slower than plain C#. I have the following very simple system:
    Code (CSharp):
    1. struct BenchmarkComponent : IComponentData
    2. {
    3.     public float3 Value;
    4. }
    5.  
    6. class BenchmarkSystem : JobComponentSystem
    7. {
    8.     [BurstCompile]
    9.     struct BenchmarkEditComponentJob : IJobForEach<BenchmarkComponent>
    10.     {
    11.         public void Execute(ref BenchmarkComponent component)
    12.         {
    13.             component.Value = math.sqrt(5 + component.Value);
    14.         }
    15.     }
    16.  
    17.     protected override JobHandle OnUpdate(JobHandle inputDeps)
    18.     {
    19.         return new BenchmarkEditComponentJob().ScheduleSingle(this, inputDeps);
    20.     }
    21. }
    And for comparison, this regular C# code (run as a command line program, not using Unity):
    Code (CSharp):
    1. Vector3[] values = new Vector3[1_000_000];
    2. const int iterations = 100;
    3.  
    4. var watch = Stopwatch.StartNew();
    5. for (int i = 0; i < iterations; i++)
    6. {
    7.     for (int j = 0; j < values.Length; j++)
    8.     {
    9.         values[j].X = (float)Math.Sqrt(5 + values[j].X);
    10.         values[j].Y = (float)Math.Sqrt(5 + values[j].Y);
    11.         values[j].Z = (float)Math.Sqrt(5 + values[j].Z);
    12.     }
    13. }
    14.  
    15. Console.WriteLine(watch.Elapsed.TotalMilliseconds / iterations);
    Below are the times I measured. Unity times were measured with the profiler in a standalone development build with 1 million entities.

    Regular C#: 5.5 ms
    Unity Burst: 2ms
    Unity IL2CPP (no Burst): 20ms
    Unity Mono (no Burst): 103 ms

    So Burst gives a nice 2.75x speedup over plain C#. But IL2CPP, which I'd expect to be faster than C#, is much slower, and Mono is nearly 20x slower. Why is this?
     
  2. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    436
    This isn't a fair comparison. The Unity code is running a system, iterating through chunks, scheduling a job, and then waiting on a semaphore for it to complete, then recording the time (I think? I have no idea how you are profiling 100 iterations) while the bottom code is iterating through a flat array.

    I forget whether the safety system is included in development builds. It it is, that's another factor.
     
    Draveler likes this.
  3. Mike37

    Mike37

    Joined:
    Oct 21, 2018
    Posts:
    21
    That might explain a 10% difference in time. It doesn't come close to explaining an 1,870% difference.

    Moreover, all that overhead would apply to Burst as well. If iterating through chunks and scheduling the job accounted for 97 ms of the time it took, the Burst version would be taking over 97 ms. All I did was remove the BurstCompile attribute for the non-Burst versions.

    The 100 iterations just keeps the timing consistent (and keeps the cache warm). Unity iterates on the same data every update and the timings in the profiler are already consistent.
     
  4. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    436
    I've seen crazy overheads in C# before. And it is totally possible that Burst is eating that overhead and that's why it is only 2.75x faster and not 50x faster like I usually see. You really need to have an apples to apples comparison especially when comparing Burst to non-Burst. There's a lot of code in chunk iteration that is optimized for Burst but is probably a lot slower with normal C# compilers because it isn't really optimized for that.
     
    GilCat likes this.
  5. GilCat

    GilCat

    Joined:
    Sep 21, 2013
    Posts:
    417
    Interesting tests!
    I tried them in my computer and just changing math.sqrt to Math.Sqrt gives you 2X boost but only in non-burst compile.
    Also using IJobChunk gives another 2X boost in non-burst specially due to passing the Array elements as Ref in IJobForEach<>.
    So like @DreamingImLatios said there is a lot of overhead that burst will eat up.
     
    Mike37 likes this.
  6. Creepgin

    Creepgin

    Joined:
    Dec 14, 2010
    Posts:
    268
    Well, the overhead of the Job System will be very apparent if you are just using
    ScheduleSingle
    (which I assume you are doing on purpose for this test). You only have massive gains when you can run things in parallel.
     
    Last edited: Aug 17, 2019
  7. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    1,666
    Odd considering in non burst it's just calling Math.Sqrt with a simple cast so I don't understand why that would halve performance.

    Code (CSharp):
    1. /// <summary>Returns the square root of a float value.</summary>
    2. public static float sqrt(float x) { return (float)System.Math.Sqrt((float)x); }
     
  8. Mike37

    Mike37

    Joined:
    Oct 21, 2018
    Posts:
    21
    It's not possible that Burst is eating 97 ms of overhead. The entire Burst job is only 2 ms. The entire frame is only about 4 ms.

    I think you have your analysis backwards: you usually see Burst 50x faster because Unity Jobs compiled with Mono are exceptionally slow. Burst shouldn't be 50x faster than regular C# because hand-optimized, vectorized C++ isn't 50x faster. It's 2-3x faster: https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/csharpcore-gpp.html

    Yeah, that's really the crux of this question. It looks like something isn't optimized well when Burst is disabled. What is it and can it be improved?

    I have a few Jobs in another project that don't currently work with Burst. It would suck to know I'm taking a 20x performance hit by using ECS instead of just sticking the data in a plain array.

    Just tried that, and I'm also getting the boost. Odd. Maybe it's not getting inlined with Mono?
     
  9. Mike37

    Mike37

    Joined:
    Oct 21, 2018
    Posts:
    21
    The total time (summed) is about the same if I change ScheduleSingle() to Schedule(), just divided among 4 threads. It improves frame time of course, but so would sticking the regular C# code in a Parallel.For() loop (brings it down from 5.5 ms to about 2 ms).

    Edit: The total times with Schedule() are about the same for the non-Burst versions. For the Burst version, the total time increases a bit to 3-4 ms (about 1 ms per thread). Probably because overhead is noticeable on a 1 ms job but not so much on a 25 ms job.
     
    Last edited: Aug 17, 2019
  10. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    436
    It's this concept that makes the comparison completely unfair. Mono can optimize basic math about as well as C++ can minus the vectorization. But it loses out on a lot of that optimization the more complicated your code becomes. More function calls, even if they are just wrappers? Less optimizations. More generics? Less optimizations. More unique classes? Less optimizations. Passing things by refs? Less optimizations. Complex data structures that start to mix unsafe contexts? Way less optimizations. And Unity's ECS codebase is heavily affected by this. And because of the way Burst works, it can see all the code involved in a job at once and aggressively optimize. So as the codebase becomes more complex, Burst's speedup over Mono becomes larger.

    Unity isn't really asking the question "how do we make non-burst code faster in jobs?" as much as they are asking "how do we make non-burstable job code burstable?"

    I'm not saying you are wrong. I'm just saying your comparison is too apples vs oranges that I am skeptical of your extrapolated conclusion.
     
    GilCat and Ivan-Pestrikov like this.
  11. Mike37

    Mike37

    Joined:
    Oct 21, 2018
    Posts:
    21
    No, that would just mean they need to add a compiler hint to inline it. Although I don't think that's the problem because a) C# would inline that function, and b) if I apply the NoInlining attribute in the non-Unity version, it only increases the time from 5.2 ms to 7 ms. We'd still have to explain what the CPU is doing for the other 96 ms.

    But the Mono build is 19x slower than C#, not C++. Even the C++ build, using IL2CPP, is 4x slower than C#.
    Burst is not globally disabled. It's only disabled for that one, single-line benchmark job. Everything else is the same.

    It's a completely fair comparison. Yes, ECS has a bit of extra work to do in iterating through the chunks and scheduling a job, but that should account for less than 1 ms. Yes, C# is slower than optimized, vectorized C++, which is why regular C# takes 5.5 ms to do this instead of 2 ms. But the Mono and IL2CPP versions should be taking somewhere in the ballpark of the C# version (5.5 ms). Not 100 ms. 2-3x slower than Burst makes sense. 50x slower does not. Nothing so far explains why math.sqrt() makes it twice as slow as Math.Sqrt(), or why iterating the elements yourself using IJobChunk is so much faster than IJobForEach (passing a parameter in by ref should be free if the method is inlined, and still very cheap otherwise).
     
    GilCat and mnarimani like this.
  12. Mike37

    Mike37

    Joined:
    Oct 21, 2018
    Posts:
    21
    So I think I understand a bit of why math.sqrt() is slow without Burst. They define the method like this:
    Code (CSharp):
    1. /// <summary>Returns the componentwise square root of a float3 vector.</summary>
    2. public static float3 sqrt(float3 x) { return new float3(sqrt(x.x), sqrt(x.y), sqrt(x.z)); }
    There's no AggressiveInlining attribute, and there's no "in" or "ref" on the parameter. float3 is a struct, so the whole thing gets copied whenever it's passed to a function. Conceptually, with no inlining or other optimizations, our one-line benchmarking method might look like this:
    Code (CSharp):
    1. // equivalent to:
    2. // component.Value = math.sqrt(5 + component.Value);
    3. var five = new float3(5);
    4. var lhs = five;    // copy
    5. var rhs = component.Value;    // copy
    6. var sum = new float3(lhs.x + rhs.x, lhs.y + rhs.y, lhs.z + rhs.z);
    7. var arg = sum;    // copy
    8. var sqrt = new float3(Math.Sqrt(arg.x), Math.Sqrt(arg.y), Math.Sqrt(arg.z));
    9. component.Value = sqrt;    // copy
    Notice every line has to make a new float3 or a copy, copying 3 floats each time.

    Now the compiled Mono code shouldn't actually be that bad, because it will inline some of the function calls and optimize away a bunch of redundant copies. But in general, passing large structs by value to non-inlined functions will hurt performance. The issue is compounded if methods in the math class in turn call other methods that are also not inlined and take parameters by value.

    Anyway, that's my guess for why math.sqrt() is slow. It should be fixable either by adding an AgressiveInlining attribute or making overloads that take parameters by reference.
     
    Creepgin and GilCat like this.
  13. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    4,671
    Is the mono code you are measuring ECS / NativeArray code or the C# version?

    ECS / NativeArray code without burst jumps through many hoops. (Safety checks, extracting data in array via method call etc) In mono a normal array access is faster than NativeArray access for example. (Arrays have special optimization in the JIT that can't be beaten with a container like NativeArray out of the box).

    The main thing to remember is that when using DOTs, you don't write code that actually runs in MONO or IL2CPP when you care about performance. All of that code generally runs in Burst.
    Where the codegen can take advantage of NativeArray and get massive optimizations.

    So while it is surprisingly slow, it also meaningless to measure it.

    It is possible we could optimise mono & il2cpp for the NativeArray case further and make it as fast, but we do not see a reason for spending the time on it because it's not how any of the deployed code should run anyway. We spend our time on optimising Burst even further...
     
  14. Creepgin

    Creepgin

    Joined:
    Dec 14, 2010
    Posts:
    268
    So basically, don't bother writing non-burstable jobs when you care about absolute total cycles (i.e. server environment).
     
  15. Mike37

    Mike37

    Joined:
    Oct 21, 2018
    Posts:
    21
    Hi Joachim, the mono code is exactly this:
    Code (CSharp):
    1. struct BenchmarkComponent : IComponentData
    2. {
    3.     public float3 Value;
    4. }
    5. class BenchmarkSystem : JobComponentSystem
    6. {
    7.     //[BurstCompile]  - disabled Burst
    8.     struct BenchmarkEditComponentJob : IJobForEach<BenchmarkComponent>
    9.     {
    10.         public void Execute(ref BenchmarkComponent component)
    11.         {
    12.             component.Value = math.sqrt(5 + component.Value);
    13.         }
    14.     }
    15.     protected override JobHandle OnUpdate(JobHandle inputDeps)
    16.     {
    17.         return new BenchmarkEditComponentJob().ScheduleSingle(this, inputDeps);
    18.     }
    19. }
    Time was measured in a standalone development build with the profiler attached with 1 million entities. It took 103 ms. It's about 50x slower than with Burst and about 20x slower than the equivalent C# console application I posted.

    Right now I suspect UnsafeUtilityEx.ArrayElementAsRef() in ExecuteChunk() might be responsible for much of the slowness, but I'm not sure why.
     
  16. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    1,666
    If you're trying to compare wouldn't this

    Code (CSharp):
    1. var watch = Stopwatch.StartNew();
    2. new BenchmarkEditComponentJob().ScheduleSingle(this, inputDeps).Complete();
    3. Debug.Log(watch.Elapsed.TotalMilliseconds / iterations);
    4.  
    be a fairer comparison. I'm not sure it'd make a huge difference but I think consistency is important.

    That said it's still not a fair test as you're comparing the entire entity system to just a single loop.
    A simple IJob with the identical code seems much fairer testing performance.

    Code (CSharp):
    1. struct BenchmarkEditComponentJob : IJob
    2. {
    3.     public void Execute()
    4.     {
    5.         for (int i = 0; i < iterations; i++)
    6.         {
    7.             for (int j = 0; j < values.Length; j++)
    8.             {
    9.                 values[j].X = (float)Math.Sqrt(5 + values[j].X);
    10.                 values[j].Y = (float)Math.Sqrt(5 + values[j].Y);
    11.                 values[j].Z = (float)Math.Sqrt(5 + values[j].Z);
    12.             }
    13.         }
    14.     }
    15. }
     
  17. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    4,671
    What i am saying is that almost all code should be written in burst. Including server environment code. We are building our netcode library to allow for that.
     
    GilCat and Lars-Steenhoff like this.
  18. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    1,666
    So yeah I don't feel your tests were that fair for what you seem to be trying to compare so I wrote my own and I have very different results.

    TLDR: mono slow, IL2CPP slightly faster than regular c#, burst definitely shows a speed up

    All tested on a 6-7 year old 3570K on a windows machine.
    I included a both a regular test outside of jobs and a job equivalent just to compare, but as you'd expect they are near identical. Burst is definitely a bit faster.
    The work I did in test is calculate the first 1000 prime numbers.

    Using Performance Test

    Firstly I wrote it as a performance test and this is the result.

    upload_2019-8-18_9-4-2.png

    Then I wrote a similar test at runtime.

    2019.2.0f1, windows build.

    Windows - Mono

    Code (CSharp):
    1. Regular: 1082.1713
    2. Job: 1065.561
    3. Burst: 339.9621
    Windows - IL2CPP

    Code (CSharp):
    1. Regular: 443.1264
    2. Job: 458.5947
    3. Burst: 379.9856
    Similar result, regular and job is nearly exactly the same.
    Burst being a bit faster.
    I think the big take away from this is how much better IL2CPP is than mono

    Regular C#
    OK, so now to compare to a regular c# console app.
    .net framework 4.7.2 windows console application.

    upload_2019-8-18_8-54-57.png

    That is very similar to the speed speed of IL2CPP
    So my results definitely do not match the original.

    Full source code.

    Performance Test

    Code (CSharp):
    1. namespace BovineLabs.PerformanceTests
    2. {
    3.     using NUnit.Framework;
    4.     using Unity.Burst;
    5.     using Unity.Collections;
    6.     using Unity.Jobs;
    7.     using Unity.PerformanceTesting;
    8.  
    9.     /// <summary>
    10.     /// The JobPerformanceTests.
    11.     /// </summary>
    12.     public class JobPerformanceTests
    13.     {
    14.         private const int Count = 1000;
    15.  
    16.         [Test]
    17.         [Performance]
    18.         public void Test()
    19.         {
    20.             var output = new NativeArray<long>(Count, Allocator.TempJob);
    21.  
    22.             Measure.Method(() =>
    23.                 {
    24.                     for (var i = 0; i < Count; i++)
    25.                     {
    26.                         output[i] = FindPrimeNumber(i);
    27.                     }
    28.                 })
    29.                 .Run();
    30.  
    31.             output.Dispose();
    32.         }
    33.  
    34.         [Test]
    35.         [Performance]
    36.         public void JobTest()
    37.         {
    38.             var output = new NativeArray<long>(Count, Allocator.TempJob);
    39.  
    40.             Measure.Method(() =>
    41.                 {
    42.                     new Job
    43.                     {
    44.                         Count = Count,
    45.                         Output = output,
    46.                     }.Schedule().Complete();
    47.                 })
    48.                 .Run();
    49.  
    50.             output.Dispose();
    51.         }
    52.  
    53.         [Test]
    54.         [Performance]
    55.         public void BurstTest()
    56.         {
    57.             var output = new NativeArray<long>(Count, Allocator.TempJob);
    58.  
    59.             Measure.Method(() =>
    60.                 {
    61.                     new BurstJob
    62.                     {
    63.                         Count = Count,
    64.                         Output = output,
    65.                     }.Schedule().Complete();
    66.                 })
    67.                 .Run();
    68.  
    69.             output.Dispose();
    70.         }
    71.  
    72.         private struct Job : IJob
    73.         {
    74.             public NativeArray<long> Output;
    75.  
    76.             public int Count;
    77.  
    78.             public void Execute()
    79.             {
    80.                 for (var i = 0; i < this.Count; i++)
    81.                 {
    82.                     Output[i] = FindPrimeNumber(i);
    83.                 }
    84.             }
    85.         }
    86.  
    87.         [BurstCompile]
    88.         private struct BurstJob : IJob
    89.         {
    90.             public NativeArray<long> Output;
    91.  
    92.             public int Count;
    93.  
    94.             public void Execute()
    95.             {
    96.                 for (var i = 0; i < this.Count; i++)
    97.                 {
    98.                     Output[i] = FindPrimeNumber(i);
    99.                 }
    100.             }
    101.         }
    102.  
    103.         private static long FindPrimeNumber(int n)
    104.         {
    105.             int count = 0;
    106.             long a = 2;
    107.             while (count < n)
    108.             {
    109.                 long b = 2;
    110.                 int prime = 1; // to check if found a prime
    111.                 while (b * b <= a)
    112.                 {
    113.                     if (a % b == 0)
    114.                     {
    115.                         prime = 0;
    116.                         break;
    117.                     }
    118.  
    119.                     b++;
    120.                 }
    121.  
    122.                 if (prime > 0)
    123.                 {
    124.                     count++;
    125.                 }
    126.  
    127.                 a++;
    128.             }
    129.  
    130.             return --a;
    131.         }
    132.     }
    133. }
    Runtime Test

    Code (CSharp):
    1. namespace BovineLabs.Common
    2. {
    3.     using System.Diagnostics;
    4.     using Unity.Burst;
    5.     using Unity.Collections;
    6.     using Unity.Jobs;
    7.     using UnityEngine;
    8.     using Debug = UnityEngine.Debug;
    9.  
    10.     /// <summary>
    11.     /// The PerformanceTest.
    12.     /// </summary>
    13.     public class PerformanceTest : MonoBehaviour
    14.     {
    15.         private const int Count = 1000;
    16.  
    17.         private void Start()
    18.         {
    19.             // Run once and ignore output
    20.             RegularTest(false);
    21.             JobTest(false);
    22.             BurstTest(false);
    23.  
    24.             RegularTest(true);
    25.             JobTest(true);
    26.             BurstTest(true);
    27.         }
    28.  
    29.         private static void RegularTest(bool outputLog)
    30.         {
    31.             var output = new NativeArray<long>(Count, Allocator.TempJob);
    32.  
    33.             var watch = Stopwatch.StartNew();
    34.             for (var i = 0; i < Count; i++)
    35.             {
    36.                 output[i] = FindPrimeNumber(i);
    37.             }
    38.  
    39.             if (outputLog)
    40.             {
    41.                 Debug.Log($"Regular: {watch.Elapsed.TotalMilliseconds}");
    42.             }
    43.  
    44.             output.Dispose();
    45.         }
    46.  
    47.         private static void JobTest(bool outputLog)
    48.         {
    49.             var output = new NativeArray<long>(Count, Allocator.TempJob);
    50.  
    51.             var watch = Stopwatch.StartNew();
    52.             new Job
    53.             {
    54.                 Count = Count,
    55.                 Output = output,
    56.             }.Schedule().Complete();
    57.  
    58.             if (outputLog)
    59.             {
    60.                 Debug.Log($"Job: {watch.Elapsed.TotalMilliseconds}");
    61.             }
    62.  
    63.             output.Dispose();
    64.         }
    65.  
    66.         private static void BurstTest(bool outputLog)
    67.         {
    68.             var output = new NativeArray<long>(Count, Allocator.TempJob);
    69.  
    70.             var watch = Stopwatch.StartNew();
    71.  
    72.             new BurstJob
    73.             {
    74.                 Count = Count,
    75.                 Output = output,
    76.             }.Schedule().Complete();
    77.  
    78.             if (outputLog)
    79.             {
    80.                 Debug.Log($"Burst: {watch.Elapsed.TotalMilliseconds}");
    81.             }
    82.  
    83.             output.Dispose();
    84.         }
    85.  
    86.         private struct Job : IJob
    87.         {
    88.             public NativeArray<long> Output;
    89.  
    90.             public int Count;
    91.  
    92.             public void Execute()
    93.             {
    94.                 for (var i = 0; i < this.Count; i++)
    95.                 {
    96.                     Output[i] = FindPrimeNumber(i);
    97.                 }
    98.             }
    99.         }
    100.  
    101.         [BurstCompile]
    102.         private struct BurstJob : IJob
    103.         {
    104.             public NativeArray<long> Output;
    105.  
    106.             public int Count;
    107.  
    108.             public void Execute()
    109.             {
    110.                 for (var i = 0; i < this.Count; i++)
    111.                 {
    112.                     Output[i] = FindPrimeNumber(i);
    113.                 }
    114.             }
    115.         }
    116.  
    117.         private static long FindPrimeNumber(int n)
    118.         {
    119.             int count = 0;
    120.             long a = 2;
    121.             while (count < n)
    122.             {
    123.                 long b = 2;
    124.                 int prime = 1; // to check if found a prime
    125.                 while (b * b <= a)
    126.                 {
    127.                     if (a % b == 0)
    128.                     {
    129.                         prime = 0;
    130.                         break;
    131.                     }
    132.  
    133.                     b++;
    134.                 }
    135.  
    136.                 if (prime > 0)
    137.                 {
    138.                     count++;
    139.                 }
    140.  
    141.                 a++;
    142.             }
    143.  
    144.             return --a;
    145.         }
    146.     }
    147. }
    Code (CSharp):
    1. using System;
    2. using System.Diagnostics;
    3.  
    4. namespace PerformanceTest
    5. {
    6.     class Program
    7.     {
    8.         private const int Count = 1000;
    9.  
    10.         static void Main(string[] args)
    11.         {
    12.             RegularTest(false);
    13.             RegularTest(true);
    14.             RegularTest(true);
    15.             RegularTest(true);
    16.         }
    17.  
    18.         private static void RegularTest(bool outputLog)
    19.         {
    20.             var output = new long[1000];
    21.  
    22.             var watch = Stopwatch.StartNew();
    23.             for (var i = 0; i < Count; i++)
    24.             {
    25.                 output[i] = FindPrimeNumber(i);
    26.             }
    27.  
    28.             if (outputLog)
    29.             {
    30.                 Console.WriteLine($"Regular: {watch.Elapsed.TotalMilliseconds}");
    31.             }
    32.         }
    33.  
    34.         private static long FindPrimeNumber(int n)
    35.         {
    36.             int count = 0;
    37.             long a = 2;
    38.             while (count < n)
    39.             {
    40.                 long b = 2;
    41.                 int prime = 1; // to check if found a prime
    42.                 while (b * b <= a)
    43.                 {
    44.                     if (a % b == 0)
    45.                     {
    46.                         prime = 0;
    47.                         break;
    48.                     }
    49.  
    50.                     b++;
    51.                 }
    52.  
    53.                 if (prime > 0)
    54.                 {
    55.                     count++;
    56.                 }
    57.  
    58.                 a++;
    59.             }
    60.  
    61.             return --a;
    62.         }
    63.     }
    64. }
    65.  
     
    Last edited: Aug 18, 2019
    Creepgin likes this.
  19. Mike37

    Mike37

    Joined:
    Oct 21, 2018
    Posts:
    21
    I think you're misunderstanding the purpose here. I'd like to find out why ECS is so slow and if/how it could be improved. You completely eliminated ECS from your test, so you're no longer benchmarking the thing we're interested in.

    I'm not sure what the goal of your benchmark there is. Are trying to measure job scheduling overhead? Compare regular .Net to Mono?

    That's interesting. It looks like either RyuJIT is faster than Mono or writing to a regular array is faster than a NativeArray, or some combination of the two. That could explain some of the slowness of ECS. It doesn't fully explain the 20x difference though.
     
  20. Creepgin

    Creepgin

    Joined:
    Dec 14, 2010
    Posts:
    268
    @Mike37 I totally get your original purpose. But there's a lot of internal stuff going on in IJobForEach and its counterparts, or just with ECS in general. I think it's a little bit moot now trying to pinpoint the root cause because:

    1) There are probably a lot of contributing factors.
    2) @Joachim_Ante just said they are not going to spend time optimizing this aspect of ECS due to higher priorities with Burst. And frankly, I agree with that decision because most things should be Burstable in the long run.

    @Joachim_Ante Yes, I wasn't implying that you shouldn't use DOTS for server code, just non-burstable jobs. We are on the same page here. Though currently, I do have some non-burstable jobs in my project and they all involve EntityCommandBuffers.

    @tertle I find it funny that the number you get for Burst + Mono is better than Burst + IL2CPP
     
    Last edited: Aug 18, 2019
  21. Mike37

    Mike37

    Joined:
    Oct 21, 2018
    Posts:
    21
    So the jitted output from Mono looks rather silly. My knowledge of assembly is limited. Maybe someone can clear this up.
    Code (CSharp):
    1. // Disassembly for: component.Value.y = (float)Math.Sqrt(5 + component.Value.y);
    2. // The next 6 instructions seem pointless. Store 0 in r11, check if r11 equals itself, and then always skip the next two instructions. Why is this done?
    3. 20575DDB7E0 - mov r11d,00000000 { 0 }
    4. 20575DDB7E6 - test r11,r11
    5. 20575DDB7E9 - je 20575DDB7F3           // should alway jump.
    6. 20575DDB7EB - mov r11,[rsp+08]
    7. 20575DDB7F0 - call qword ptr [r11]
    8. 20575DDB7F3 - nop                      // jumps to here
    9.  
    10. 20575DDB7F4 - mov rax,rsi               // rax already equals rsi.
    11. 20575DDB7F7 - add rax,00 { 0 }          // another seemingly pointless instruction (rax += 0)
    12. 20575DDB7FB - mov [rsp+30],rax
    13. 20575DDB800 - movss xmm0,[20575DDB940] { (5.00) }  // xmm0 = 5f
    14. 20575DDB808 - cvtss2sd xmm0,xmm0        // xmm0 = (double)xmm0;
    15. 20575DDB80C - mov rax,rsi               // rax still has the same value. It doesn't matter how many times you add zero, Mono...
    16. 20575DDB80F - add rax,00 { 0 }          // this again
    17. 20575DDB813 - movss xmm1,[rax+04]       // xmm1 = component.Value.y
    18. 20575DDB818 - cvtss2sd xmm1,xmm1        // xmm1 = (double)component.Value.y
    19. 20575DDB81C - addsd xmm0,xmm1           // xmm0 += xmm1. So far we completed (5 + component.Value.y).
    20.  
    21. // 5 of the next 6 instructions are just moving the value in xmm0 to different locations. The next 5 instructions could probably be replaced by a single 'sqrtss' instruction.
    22. 20575DDB820 - movsd [rsp-08],xmm0       // store result on the stack
    23. 20575DDB826 - fld qword ptr [rsp-08]    // push to FPU register stack
    24. 20575DDB82A - fsqrt
    25. 20575DDB82C - fstp qword ptr [rsp-08]   // pop back to stack
    26. 20575DDB830 - movsd xmm0,[rsp-08]       // off the stack, back into xmm0
    27. 20575DDB836 - movsd [rsp+38],xmm0       // and back on the stack again
    28.  
    29. // This strange seemingly pointless pattern again.
    30. 20575DDB83C - mov r11d,00000000 { 0 }   // r11 is already 0
    31. 20575DDB842 - test r11,r11
    32. 20575DDB845 - je 20575DDB84F
    33. 20575DDB847 - mov r11,[rsp+08]
    34. 20575DDB84C - call qword ptr [r11]
    35. 20575DDB84F - nop                       // jumps to here
    36.  
    37. 20575DDB850 - mov rax,[rsp+30]          // again, they're already equal. We haven't changed rax since we saved its value to [rsp+30].
    38. 20575DDB855 - movsd xmm0,[rsp+38]       // already equal
    39. // I have no idea why these next 3 lines exist...
    40. 20575DDB85B - cvtsd2ss xmm0,xmm0       // xmm0 = (float)xmm0
    41. 20575DDB85F - cvtss2sd xmm0,xmm0       // xmm0 = (double)xmm0
    42. 20575DDB863 - cvtsd2ss xmm5,xmm0       // xmm5 = (float)xmm0
    43. 20575DDB867 - movss [rax+04],xmm5      // component.Value.y = xmm5
    It does a number of strange things. If anyone knows why at the end it pointlessly converts to a double and back, I'd be interested.

    Nonetheless, I don't think Mono is the main issue. It's only about 2x slower than regular .Net, not 20x slower.
     
  22. Mike37

    Mike37

    Joined:
    Oct 21, 2018
    Posts:
    21
    @Creepgin I agree too, I'd rather use Burst where possible.

    I'm not asking the Unity team to fix it - they'll balance their priorities and time constraints as they see fit. I'm just investigating, and I guess bringing it to their attention.

    In this case, I wouldn't be surprised if a very small change could give at least a 5x speedup. There's usually some low hanging fruit with slowdowns of this magnitude, and the profiler seems to indicate the problem lies in these lines of IJobForEach.gen.cs:
    Code (CSharp):
    1. for (var i = 0; i != count; i++)
    2. {
    3.     jobData.Data.Execute(ref UnsafeUtilityEx.ArrayElementAsRef<T0>(ptr0, i));
    4. }
    Which doesn't look like it should be slow. I wonder if something as simple as,
    Code (CSharp):
    1. for (var i = ptr0; i != endAddress; i += elementSize)
    2. {
    3.     jobData.Data.Execute(ref Unsafe.AsRef<T>(i));
    4. }
    might fix it?
     
  23. snacktime

    snacktime

    Joined:
    Apr 15, 2013
    Posts:
    2,370

    I would assume they have good reasons for stuff like this, it's not the type of thing you just miss. In this specific case Unsafe is implemented with raw IL so it kind of makes sense they would replace that with their own native path.
     
  24. Mike37

    Mike37

    Joined:
    Oct 21, 2018
    Posts:
    21
    UnsafeUtilityEx.ArrayElementAsRef() calls Unsafe.AsRef(). This is just some manual inlining.
     
  25. Mike37

    Mike37

    Joined:
    Oct 21, 2018
    Posts:
    21
    So it turns out that having a development build with the profiler attached affects Mono much more than Burst or IL2CPP. That combined with @GilCat's observation on math.sqrt(float3) being much slower than System.Math.Sqrt(double) in Mono explains most of the discrepancy.

    These are my timings in a non-development build without the profiler:
    Mono with math.sqrt(float3): 86 ms
    IL2CPP: 16 ms
    Mono with Math.Sqrt(): 10 ms
    Burst: 1.7ms

    So Mono is down from 103 ms in the original test to a much more reasonable 10 ms. It's interesting that IL2CPP is now slower than Mono. Also, using IJobForEach vs IJobChunk are about the same now, even though IJobChunk was faster in the development build.

    So, I guess, don't profile with the profiler if you're comparing Mono to Burst or IL2CPP.
     
    Creepgin likes this.
  26. alexeyzakharov

    alexeyzakharov

    Unity Technologies

    Joined:
    Jul 2, 2014
    Posts:
    275
    Do you see the profiler impact when it is disabled in development build?
    Profiler should not impact Mono performance unless deep profiler is enabled (but that is available only in >=2019.3).
    The option which decreases Mono performance is "Script debugging" and it might be enabled when you choose Development Player.
     
  27. Mike37

    Mike37

    Joined:
    Oct 21, 2018
    Posts:
    21
    Ah, okay, you're right, I rebuilt again with "Script debugging" off and got 12 ms.
     
    alexeyzakharov likes this.
  28. alexeyzakharov

    alexeyzakharov

    Unity Technologies

    Joined:
    Jul 2, 2014
    Posts:
    275
    Great, thank you for the checking this out! :)
     
  29. any_user

    any_user

    Joined:
    Oct 19, 2008
    Posts:
    340
    Does that mean cross compiling will be possible soon? Currently this is the main reason why we also care about non-burst performance for some platforms (otherwise we'd need different build machines for the different standalone builds).
     
  30. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    4,671
    There is no plan for cross compiling. You will need to have build machines with the relevant SDK's.