Search Unity

Job System not as fast as mine, why not?

Discussion in 'Data Oriented Technology Stack' started by TBbadmofo, Feb 20, 2018.

  1. TBbadmofo

    TBbadmofo

    Joined:
    Apr 1, 2014
    Posts:
    14
    When I heard about the job system, I started to rewrite my code for parallel processing to be ready for it. This was my first time writing something like this as well. I figured the Job System would do it much better. My code takes 60 ms to execute and the Job System takes 120 ms to execute. I did benchmark this with development build unchecked and outside of unity editor. Both of these use 2 arrays. One is an int storage array for processing(no telling how many results will come back, but in this case around 120k) and the other is a byte array of pixel data about a 4000x1500 image so 4 bytes per pixel, 24 million sized array.

    In my code I use a linkedlist for the results and in the Job System I just set the native array a little bigger than the 120k just to see this performance. The other array I just use a byte array as thats what pixel data already is, and in the Job System I just use another Native Array but this one is read only. I used 64 for the Job System schedule that about seemed the best performance. The Execute code of the Job System did about 50 ms. The other 70 ms is spent on the 2 arrays, pretty much all of it is the pixel data array though.

    In my code I can't really benchmark without the array creation. I tried to use what you guys said the Job System would do so it would be somewhat similar. What I do is split the job up over how many processors the system has(in my case 6). Each processor gets a linked list to return as a result and each one gets a chunk of the 4000 pixels, 2 processors get 666 pixels and the others get 667. Then I use System.Threading.Tasks.Parallel.For which I set the for loop to do 6 one for each processor. Dividing up the chunks keeps my code safe and having individual results makes sure I have no collisions. Now none of this dividing up is static. It's just as flexible as the Job System, it can handle any sized image and amount of processors, so I have that performance hit as well.

    Other than array creation the code is the same. Even the 50ms comes awfully close to mine and if I could subtract out my array creation I'd probably beat it.

    Any thoughts on this?
     
  2. laurentlavigne

    laurentlavigne

    Joined:
    Aug 16, 2012
    Posts:
    2,012
    1. read @dyox posts
    2. wait for the official release of the jobs system
     
  3. MartinGram

    MartinGram

    Unity Technologies

    Joined:
    Feb 24, 2017
    Posts:
    40
    If you posted the sample code in question, we will likely be able to provide you with better answers.
     
  4. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    4,642
    A couple general thoughts on what to expect:

    1. In the editor NativeArray vs builtin arrays have debugging overhead:
    - We detect race conditions
    - in IJobParallelFor we detect writing to wrong ranges of indices

    2. Mono JIT itself has dedicated instructions for array access, in mono we can't get the exact speed as array lookups. However IL2CPP we are on par/better comparing NativeArray vs builtin arrays. We expect that these days most of our users use il2cpp for the final deployed game for the best performance. So please measure with IL2CPP in standalone player. (Also see note below for latest build with some optimizations that will make it into 18.1)

    3. The Job scheduler in unity is significantly less overhead. Best way to measure is to schedule a bunch of empty jobs. Again editor has quite a bit of overhead due to race condition detection. So its important to measure in standalone player. There are two important things to measure
    - GC allocations caused by scheduling a job. Our view is that keeping it to zero is critical to avoid GC collections later on. We do that, ParallelTasks very much does not
    - Cost to actually schedule + execute
    - Cost of actually running in harmony with other engine threads. (Reducing context switch cost) Unity Job system uses the same job system as engine code allowing for greater integration & no context switch cost

    4. Ultimately neither mono nor IL2CPP performance really matters. The compiler we expect all users to use for C# jobs is Burst. This will NOT be available in 18.1 but likely in 18.2. Burst itself does not know what builtin array is. Essentially burst is a compiler dedicated to the problem of making C# jobs and a specific subset of C# to get the absolute best performance you could hope for. For this reason we make the assumption that there are exactly no GC types in the type of code that burst executes. Hence everything is Native containers + structs. This is a part of what enables the 5x-10x speedups we are usually seeing in burst vs mono/il2cpp. Also we generally beat C++ performance by good margins with Burst already.

    You probably want to watch this for a more complete overview of what we are aiming at:


    It would be great if you can share the specific benchmark you made so we can take a look.


    Note on 2). These il2cpp optimizations are not yet in the just beta. Here is a build from a branch that will soon make it into the official beta builds so you can do the benchmark tests today:
    https://beta.unity3d.com/download/966b48dc5f14/public_download.html
    (Build has not gone through QA, so i dont recommend using it beyond benchmarking)
     
    Last edited: Feb 20, 2018
  5. TBbadmofo

    TBbadmofo

    Joined:
    Apr 1, 2014
    Posts:
    14
    So I extracted the code out into an example project. Your job system performs pretty well. 27-28ms every time. Mine has this weird varying of 30, 60, and 90ms. In the editor I get a pretty steady 45ms with mine. Now to the weirder part. In my game in a non-development stand-alone build yours does 120ms and mine does something crazy like 400ms. But in the editor mine will give me the 60ms time. I know an example would help, but I'm just not seeing the issue in the example except for the part where mine runs slower outside the editor.

    Seems I need the Windows SDK to build IL2CPP, I will try that.

    Will all of my code benefit from burst as well? Or just what's utilizing the job system. I thought burst would optimize my math and thinking all of my code would benefit.

    How are we supposed to use the performance monitor if the job system is going to have huge safety checking overhead while in the editor? Is there a way to skip the safety overhead while using the editor?
     
    Last edited: Feb 24, 2018
  6. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    4,642
    Burst is a compiler specifically made for the C# job system. It is built specifically to take advantage of all the restriction we place on C# jobs anyway to get incredible speedups. It can not be used to run generic main thread C# code or other code that is scheduled via .NET Tasks etc.
     
  7. TBbadmofo

    TBbadmofo

    Joined:
    Apr 1, 2014
    Posts:
    14
    Okay, I'm not sure why but I'm getting very varying results and I've also managed to make the time in the standalone job system worse no idea how. I have created an example project. Here's my testing results.

    In Editor:
    Job System: ~400 ms
    Mine: ~50 ms

    Non Development Mono Build:
    Job System: 120ms
    Mine: ~400ms

    Non Development IL2CPP 2018.1.0b9:
    Job System: 250ms
    Mine: 40ms

    I have uploaded an entire project containing just the code in question.

    Here is also just the code and the test image I have used. If you setup your own project, you'll need to set the .NET to version 4 and restart unity and make the image uncompressed 4k with read write permissions and create a folder called Resources and put the image in there.
    Code (CSharp):
    1. using System.Collections;
    2. using System.Collections.Generic;
    3. using UnityEngine;
    4. using Unity.Collections;
    5. using Unity.Jobs;
    6.  
    7. public class Test : MonoBehaviour {
    8.  
    9.     System.Threading.Tasks.ParallelOptions options = new System.Threading.Tasks.ParallelOptions();
    10.  
    11.     void Start () {
    12.         Time.timeScale = 0;
    13.  
    14.         options.MaxDegreeOfParallelism = System.Environment.ProcessorCount;
    15.  
    16.         Texture2D textureImage = GameObject.Instantiate(Resources.Load<Texture2D>("testimage"));
    17.  
    18.         byte[] spriteData = textureImage.GetRawTextureData();
    19.         int width = textureImage.width;
    20.         int height = textureImage.height;
    21.  
    22.         UnityEngine.UI.Text JobBenchText = GameObject.Find("JobBenchText").GetComponent<UnityEngine.UI.Text>();
    23.         UnityEngine.UI.Text MyBenchText = GameObject.Find("MyBenchText").GetComponent<UnityEngine.UI.Text>();
    24.  
    25.         var Timer = new System.Diagnostics.Stopwatch();
    26.  
    27.         //Benchmark the Job System
    28.         Timer.Start();
    29.         var results = new NativeArray<int>(22000, Allocator.Persistent); //Using this as a phony results list
    30.         var spriteDataNative = new NativeArray<byte>(spriteData, Allocator.Temp);
    31.  
    32.         var job = new DoJobSystemTest()
    33.         {
    34.             spriteData = spriteDataNative,
    35.             results = results,
    36.             width = width,
    37.             height = height
    38.         };
    39.  
    40.         JobHandle jobHandle = job.Schedule(width, 200);
    41.         jobHandle.Complete();
    42.  
    43.         results.Dispose();
    44.         spriteDataNative.Dispose();
    45.  
    46.         Timer.Stop();
    47.         JobBenchText.text = Timer.Elapsed.ToString();
    48.  
    49.         //Benchmark My Parallel Processing
    50.         Timer.Reset();
    51.         Timer.Start();
    52.  
    53.         DoMyParallel(width, height, spriteData);
    54.  
    55.         Timer.Stop();
    56.         MyBenchText.text = Timer.Elapsed.ToString();
    57.  
    58.  
    59.         Time.timeScale = 1;
    60.     }
    61.  
    62.     struct DoJobSystemTest : IJobParallelFor
    63.     {
    64.         [ReadOnly]
    65.         public NativeArray<byte> spriteData;
    66.  
    67.         [ReadOnly]
    68.         public int width;
    69.  
    70.         [ReadOnly]
    71.         public int height;
    72.  
    73.         public NativeArray<int> results;
    74.  
    75.    
    76.  
    77.         public void Execute(int x)
    78.         {
    79.  
    80.             byte colorA;
    81.             byte colorB;
    82.             int index;
    83.  
    84.             for (int y = 0; y < height;)
    85.             {
    86.                 index = (x + y * width) * 4 + 3;
    87.                 colorA = spriteData[index];
    88.                 if (colorA != 0)
    89.                 {
    90.                     if (y + 1 < height)
    91.                     {
    92.                         colorB = spriteData[index + width * 4];
    93.                         if (colorB == 0)
    94.                         {
    95.                             //No NativeList at this time
    96.                             //results[cpu].AddLast(x);
    97.                             //results[cpu].AddLast(y);
    98.                             y += 2;
    99.                             continue;
    100.                         }
    101.                     }
    102.  
    103.                     if (y - 1 > 0)
    104.                     {
    105.                         colorB = spriteData[index - width * 4];
    106.                         if (colorB == 0)
    107.                         {
    108.                             //No NativeList at this time
    109.                             //results[cpu].AddLast(x);
    110.                             //results[cpu].AddLast(y);
    111.                             y++;
    112.                             continue;
    113.                         }
    114.                     }
    115.  
    116.                     if (x + 1 < width)
    117.                     {
    118.                         colorB = spriteData[index + 4];
    119.                         if (colorB == 0)
    120.                         {
    121.                             //No NativeList at this time
    122.                             //results[cpu].AddLast(x);
    123.                             //results[cpu].AddLast(y);
    124.                             y++;
    125.                             continue;
    126.                         }
    127.                     }
    128.  
    129.                     if (x - 1 > 0)
    130.                     {
    131.                         colorB = spriteData[index - 4];
    132.                         if (colorB == 0)
    133.                         {
    134.                             //No NativeList at this time
    135.                             //results[cpu].AddLast(x);
    136.                             //results[cpu].AddLast(y);
    137.                             y++;
    138.                             continue;
    139.                         }
    140.                     }
    141.  
    142.                     y++;
    143.                     continue;
    144.                 }
    145.                 else
    146.                 {
    147.                     y++;
    148.                 }
    149.             }
    150.         }
    151.     }
    152.  
    153.     void DoMyParallel(int width, int height, byte[] spriteData)
    154.     {
    155.         LinkedList<int>[] results = new LinkedList<int>[System.Environment.ProcessorCount];
    156.  
    157.         //Used for splitting up the width of the image between processors
    158.         int[] splitCount = new int[System.Environment.ProcessorCount + 1];
    159.  
    160.         float count = (float)width / System.Environment.ProcessorCount;
    161.  
    162.         //for an amount that doesn't divide evenly add the left overs to the other processors batch
    163.         for (int i = 1; i < Mathf.Round((count - (int)count) * System.Environment.ProcessorCount) + 1; i++)
    164.         {
    165.             splitCount[i] = 1;
    166.         }
    167.  
    168.         //initialize the results linkedlist for each processor and add the batch amount to all processors
    169.         for (int i = 0; i < System.Environment.ProcessorCount; i++)
    170.         {
    171.             results[i] = new LinkedList<int>();
    172.             splitCount[i + 1] += (int)count + splitCount[i];
    173.         }
    174.  
    175.  
    176.  
    177.         System.Threading.Tasks.Parallel.For(0, System.Environment.ProcessorCount, options, cpu =>
    178.         {
    179.             byte colorA;
    180.             byte colorB;
    181.             int index;
    182.             for (int x = splitCount[cpu]; x < splitCount[cpu + 1]; x++)
    183.             {
    184.  
    185.                 for (int y = 0; y < height;)
    186.                 {
    187.                     index = (x + y * width) * 4 + 3;
    188.                     colorA = spriteData[index];
    189.                     if (colorA != 0)
    190.                     {
    191.                         if (y + 1 < height)
    192.                         {
    193.                             colorB = spriteData[index + width * 4];
    194.                             if (colorB == 0)
    195.                             {
    196.                                 //results[cpu].AddLast(x);
    197.                                 //results[cpu].AddLast(y);
    198.                                 y += 2;
    199.                                 continue;
    200.                             }
    201.                         }
    202.  
    203.                         if (y - 1 > 0)
    204.                         {
    205.                             colorB = spriteData[index - width * 4];
    206.                             if (colorB == 0)
    207.                             {
    208.                                 //results[cpu].AddLast(x);
    209.                                 //results[cpu].AddLast(y);
    210.                                 y++;
    211.                                 continue;
    212.                             }
    213.                         }
    214.  
    215.                         if (x + 1 < width)
    216.                         {
    217.                             colorB = spriteData[index + 4];
    218.                             if (colorB == 0)
    219.                             {
    220.                                 //results[cpu].AddLast(x);
    221.                                 //results[cpu].AddLast(y);
    222.                                 y++;
    223.                                 continue;
    224.                             }
    225.                         }
    226.  
    227.                         if (x - 1 > 0)
    228.                         {
    229.                             colorB = spriteData[index - 4];
    230.                             if (colorB == 0)
    231.                             {
    232.                                 //results[cpu].AddLast(x);
    233.                                 //results[cpu].AddLast(y);
    234.                                 y++;
    235.                                 continue;
    236.                             }
    237.                         }
    238.  
    239.                         y++;
    240.                         continue;
    241.                     }
    242.                     else
    243.                     {
    244.                         y++;
    245.                     }
    246.                 }
    247.             }
    248.         });
    249.  
    250.         //Process the results
    251.  
    252.         //LinkedListNode<int> node;
    253.    
    254.         for (int cpu=0; cpu < System.Environment.ProcessorCount; cpu++)
    255.         {
    256.             /*node = results[cpu].First;
    257.             for (int i = 0; i < results[cpu].Count; i+=2)
    258.             {
    259.                 DoStuff(node.Value, node.Next.Value);
    260.                 node = node.Next.Next;
    261.             }*/
    262.             results[cpu].Clear();
    263.         }
    264.     }
    265.  
    266.     // Update is called once per frame
    267.     void Update () {
    268.    
    269.     }
    270. }
    271.  
     

    Attached Files:

    Last edited: Feb 24, 2018
    FM-Productions likes this.
  8. MartinGram

    MartinGram

    Unity Technologies

    Joined:
    Feb 24, 2017
    Posts:
    40
    TBbadmofo and Cromfeli like this.
  9. rizu

    rizu

    Joined:
    Oct 8, 2013
    Posts:
    1,080
    How does IL2CPP play with Burst compiled code? what does Burst generate? managed or native binaries?
     
  10. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    4,642
    Burst produces machine code for the target hardware. Burst transforms a subset of .NET bytecode (The subset defined by C# job system and some more) -> machine code.

    We will have more information on burst later on. We are not aiming to ship burst as part of 18.1. We are simply talking about it because I believe its important to understanding the whole concept of C# jobs and all the restrictions we place for C# jobs. To a large extent the restrictiveness of C# jobs is based around them being the same restrictions that allows Burst to produce machine code with such incredible performance gains.
     
  11. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    7,074
    Is it worth doing benchmarks to compare the various parallel/multi-threaded options available to developers and how they compare with various game related tasks/processes?

    Then when Burst is brought online it can show off it's performance advantage.
     
  12. rizu

    rizu

    Joined:
    Oct 8, 2013
    Posts:
    1,080
    I don't want to derail this topic more with Burst talk, what would be most appropriate forum section to post Burst related discussions? I couldn't find any obvious place where I should post, https://forum.unity.com/forums/experimental-scripting-previews.107/ seems most suited place for this purpose but there isn't any preview build for Burst yet so it's bit out of place there as well.
     
  13. snacktime

    snacktime

    Joined:
    Apr 15, 2013
    Posts:
    2,314
    Curious, is Burst similar in design to LLVM or is it purely JIT optimizations?
     
    Last edited: Feb 26, 2018
  14. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    7,074
    Burst transforms a subset of .NET bytecode -> machine code.

    from above topic, sound like it is more of a compiler/assembler type technology.
     
  15. snacktime

    snacktime

    Joined:
    Apr 15, 2013
    Posts:
    2,314
    Of course, but knowing whether it's designed more like LLVM/.Net Native or just purely a JIT thing gives more insight into the overall direction Unity is taking in this area. There are different approaches to the problem.
     
  16. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    7,074
    OK good point is Burst a native compiler or JIT compiler, sounds like it's a native compiler IMHO.
     
  17. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    4,642
    Burst is uses LLVM as part of the compiler stack. With additional optimizations on top of what LLVM provides.

    The integration into the editor is done as a JIT on a per job basis. Meaning that we never interrupt workflow waiting for jobs to compile. (As you would expect, compiling a large job and applying all optimization passes can take multiple seconds) We also cache them between changes to scripts if nothing in the script change affected the compilation.
     
  18. MartinGram

    MartinGram

    Unity Technologies

    Joined:
    Feb 24, 2017
    Posts:
    40
    @TBbadmofo

    I had a look at the sample you provided today. I made all of the buffer creation static, they are irrelevant to the case you provided and just created noise.

    The numbers you provided are correct and the reason why there is a performance difference lies in the way you have implemented your sample. In this particular case Parallel.For ends up executing differently than what IJobParallelFor does.

    We will provide IJobParallelForBatch soon that handles this specific case. When I measure the Parallel.For and IJobParallelForBatch codepaths against each other, we get roughly equal execution time.

    For clarity, the tests were run without Burst enabled.

    I will post the updated code in a bit.
     
  19. Ethan_VisualVocal

    Ethan_VisualVocal

    Joined:
    Mar 23, 2016
    Posts:
    128
    I think vague benchmarks, or even just guidelines, would be critical for a number of reasons... right now people mostly use Coroutines for everything task/job related.

    Once Unity C# Jobs and .NET async/await (and therefore Tasks) become widely available in Unity 2018.1, users need to be educated about what tool is appropriate for what situation. (Myself included... e.g. are there scenarios where writing new code using Coroutines is a good practice, once 2018.1 lands?)
     
  20. hippocoder

    hippocoder

    Digital Ape Moderator

    Joined:
    Apr 11, 2010
    Posts:
    25,352
    I don't think there ever was any good reason to coroutines, except for https://docs.unity3d.com/Manual/BestPracticeUnderstandingPerformanceInUnity3.html
    But trying to show people why coroutines were bad was a futile gesture on my part. That's probably why so many Unity games run smooth then pause to clean up garbage and so on... leading to a bad rep. It's best not to use them unless you're willing to manage the memory behind them (IMHO).
     
    twobob, AcidArrow and Krajca like this.
  21. recursive

    recursive

    Joined:
    Jul 12, 2012
    Posts:
    584
    They're still better than the Invoke-family of scheduling calls. And they're still a decent way of making time-sliced procedural animations rather simply.

    The problem with them has always people treating them like a proper way of doing async stuff instead of the workaround they always were, and the optimization tricks for them being less well known until the last few years.
     
  22. TBbadmofo

    TBbadmofo

    Joined:
    Apr 1, 2014
    Posts:
    14
    @MartinGram

    Thanks for looking into it it. That sounds great, look forward to that and nativelists.

    I still have one concern, why does this code(not the job system) run better in the editor and much worse in a mono release? Is the editor using something different?
     
  23. AlkisFortuneFish

    AlkisFortuneFish

    Joined:
    Apr 26, 2013
    Posts:
    585
    OT, but ever since they fixed the original shortcomings of the API I have found them pretty useful for some rather complicated state machines, with Func<IEnumerator>, nested IEnumerators etc. The fact that they are evaluated as a tree by Unity's default runner (i.e. you can yield an IEnumerator from an IEnumerator *without* StartCoroutine() and have it execute in that context) makes them very well suited to certain tasks where GC allocation can either be minimised or won't matter.

    The issue is that people use them for things they are totally overkill for, where the overhead plainly is not worth it and would not be worth it with proper async support either.
     
    recursive likes this.
  24. recursive

    recursive

    Joined:
    Jul 12, 2012
    Posts:
    584
    Oh yeah, I forgot about their handy properties for state machines. Especially since IEneumerators are implemented as a kind of state machine under the hood.
     
  25. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    7,074
    I'm not sure what the over head is but could Burst create batched assembly jobs that run on the GPU as in theory the average GPU has way more bandwidth than a CPU?

    e.g. Bitcoin POW running on GPU.
     
    laurentlavigne likes this.
  26. twobob

    twobob

    Joined:
    Jun 28, 2014
    Posts:
    1,763
    Amen, brother. "Does Homily Hands", What he said.
     
    hippocoder likes this.
  27. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    902

    I would be interested in seeing the example for IJobParallelForBatch --- could you post the code, please?