Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice
  2. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

Unity DOTS vs Compute Shader

Discussion in 'Entity Component System' started by joshrs926, Apr 14, 2021.

  1. joshrs926

    joshrs926

    Joined:
    Jan 31, 2021
    Posts:
    111
    I was curious which was faster between DOTS (jobs + Burst + Collections) or Compute Shaders for doing large amounts of calculations. So I put together this speed test. Basically I just create a huge array and tell an IJobFor and a Compute Shader to fill it up with the result of a difficult math function. I schedule each one and wait for it to complete in the background by checking each frame from a Coroutine if it has completed. Then I log the time. My own results show that the DOTS job usually wins. Transfering the array back from the GPU takes a while. Turning off Safety Checks in the Burst menu sometimes makes the jobs run faster and sometimes not.
    I have a intel I7-8700 6 core CPU and an RTX 2070 GPU.
    My conclusion is that DOTS is a bit faster. Which you should use depends on which one is easier for you to use and whether you need the data on the GPU for rendering and don't need it transferred back to CPU land.
    I'm curious what other people's results are and whether you think my test is even valid. I'd love to hear other people's thoughts/knowledge on this topic.

    Here are the scripts:

    Code (CSharp):
    1. using System.Collections;
    2. using UnityEngine;
    3. using Unity.Jobs;
    4. using Unity.Collections;
    5. using Unity.Burst;
    6. using Unity.Mathematics;
    7.  
    8.  
    9. public class Dispatcher : MonoBehaviour
    10. {
    11.     [SerializeField] int arrayLength = 4_000_000;
    12.     [SerializeField] int jobBatchSize = 32;
    13.     [SerializeField] ComputeShader shader = null;
    14.     [SerializeField] bool fetchWholeArrayFromShader = false;
    15.     int randomIndex;
    16.  
    17.     public void RunTest()
    18.     {
    19.         Debug.Log("--------");
    20.         randomIndex = UnityEngine.Random.Range(0, arrayLength);
    21.         Debug.Log($"index {randomIndex}");
    22.         RunDummyJobToForceEarlyCompilation();
    23.         StartCoroutine(StartJob());
    24.     }
    25.  
    26.     void RunDummyJobToForceEarlyCompilation()
    27.     {
    28.         float startT = Time.realtimeSinceStartup;
    29.         var dummyArray = new NativeArray<float>(1, Allocator.TempJob);
    30.         var dummyJob = new Job() { results = dummyArray };
    31.         var dummyHandle = dummyJob.Schedule(1, new JobHandle());
    32.         dummyHandle.Complete();
    33.         Log(startT, "dummyJob", "");
    34.         dummyArray.Dispose();
    35.     }
    36.  
    37.     IEnumerator StartJob()
    38.     {
    39.         float startTime = Time.realtimeSinceStartup;
    40.         var results = new NativeArray<float>(arrayLength, Allocator.TempJob);
    41.         var job = new Job() { results = results };
    42.         var handle = job.ScheduleParallel(arrayLength, jobBatchSize, new JobHandle());
    43.         while (!handle.IsCompleted)
    44.         {
    45.             yield return null;
    46.         }
    47.         handle.Complete();
    48.         float sampleValue = results[randomIndex];
    49.         results.Dispose();
    50.         Log(startTime, "Job", sampleValue.ToString());
    51.  
    52.         StartCoroutine(StartComputeShader());
    53.     }
    54.  
    55.     IEnumerator StartComputeShader()
    56.     {
    57.         float startTime = Time.realtimeSinceStartup;
    58.         int kernel = shader.FindKernel("CSMain");
    59.         ComputeBuffer buffer = new ComputeBuffer(arrayLength, sizeof(float));
    60.         shader.SetBuffer(kernel, "results", buffer);
    61.         uint x, y, z;
    62.         shader.GetKernelThreadGroupSizes(kernel, out x, out y, out z);
    63.         int groupSize = (int)(x * y * z);
    64.         shader.Dispatch(kernel, arrayLength / groupSize, 1, 1);
    65.         var request = UnityEngine.Rendering.AsyncGPUReadback.Request(buffer);
    66.         while (!request.done)
    67.         {
    68.             yield return null;
    69.         }
    70.         float sampleValue;
    71.         if (fetchWholeArrayFromShader)
    72.         {
    73.             float[] results = new float[arrayLength];
    74.             buffer.GetData(results);
    75.             sampleValue = results[randomIndex];
    76.         }
    77.         else
    78.         {
    79.             float[] results = new float[1];
    80.             buffer.GetData(results, 0, randomIndex, 1);
    81.             sampleValue = results[0];
    82.         }
    83.         buffer.Release();
    84.         Log(startTime, "Compute Shader", sampleValue.ToString());
    85.     }
    86.  
    87.     void Log(float startTime, string workName, string sampleValue)
    88.     {
    89.         Debug.Log($"{(Time.realtimeSinceStartup - startTime) * 1000} ms {workName}, sample value {sampleValue}");
    90.     }
    91. }
    92.  
    93. [BurstCompile(CompileSynchronously = true)]
    94. struct Job : IJobFor
    95. {
    96.     public NativeArray<float> results;
    97.  
    98.     public void Execute(int index)
    99.     {
    100.         results[index] = math.sin(math.sin(3));
    101.     }
    102. }
    103.  
    104. #if UNITY_EDITOR
    105. [UnityEditor.CustomEditor(typeof(Dispatcher))]
    106. public class Dispatcher_Editor : UnityEditor.Editor
    107. {
    108.     public override void OnInspectorGUI()
    109.     {
    110.         if (GUILayout.Button("Run Test"))
    111.         {
    112.             (target as Dispatcher).RunTest();
    113.         }
    114.         base.OnInspectorGUI();
    115.     }
    116. }
    117. #endif
    Code (CSharp):
    1. #pragma kernel CSMain
    2.  
    3. RWStructuredBuffer<float> results;
    4.  
    5. [numthreads(1024, 1, 1)]
    6. void CSMain(uint3 id : SV_DispatchThreadID)
    7. {
    8.     results[id.x] = sin(sin(3));
    9. }
    10.  
     
  2. JesOb

    JesOb

    Joined:
    Sep 3, 2012
    Posts:
    1,106
    I think your test is wrong sin(sin(3)) is constant
    You need at least input array with floats that will be passed to sin. So it will be results[index] = sin(sin(input[index]))
     
  3. HellGate94

    HellGate94

    Joined:
    Sep 21, 2017
    Posts:
    132
    also you are doing very little work so most of the time is spent outside the actual calculations since reading and writing data to gpu is the expensive part
     
  4. Antypodish

    Antypodish

    Joined:
    Apr 29, 2014
    Posts:
    10,754
    My view on the subject is following.
    GPU vs CPU.
    Using DOTS you can easily extend functionality. But the main question is, do I want use GPU for calculations, where CPU sits idle, while GPU do havy renderings. Or I use CPU to ofload GPU, so I can have more fancy effects on GPU.
    Many games under utilise CPU. So why not take an advantage of it, with DOTS?
     
    Last edited: Apr 15, 2021
  5. varnon

    varnon

    Joined:
    Jan 14, 2017
    Posts:
    52
    I think it is going to vary a lot depending on what your task is, and what other tasks you are doing. And also hardware, of course.

    I did a boids comparison last year. The code did not use space partitioning but was otherwise optimized. I was able to get 7000 boids at 60fps with just burst jobs and Graphics.DrawMeshInstanced - no ECS. That was my best version. The compute shader approach (with asynchronous retrieval of data and Graphics.DrawMeshInstanced) got me about 4500 boids, and the ECS version (with hybrid renderer) got me about 3500 boids at 60 fps. ECS went up to 5000 and 6000 if I updated a half or third of the boids each frame, in a way that fit the asynchronous compute shader's overall rate a little closer.
    Burst jobs alone are very, very good. Everyone should be using them all the time. Compute shaders will likely be better when you can send the data to the GPU and keep it there. Reading from the GPU is the slowest part. I think ECS is going to scale up well as you have more and more tasks in your game / program. I don't think there is going to be one right answer. You will have to test for your specific situation.
     
    Nyanpas likes this.
  6. MicCode

    MicCode

    Joined:
    Nov 19, 2018
    Posts:
    59
    FYI, dyson sphere program use GPU for logic extensively, I've read the solar sail logic is implemented in GPU
    Maybe even the conveyor logic are in the GPU as well, when you change conveyor, you can see all other conveyor stutter a bit, I am guessing it's updating the compute buffer
    Not to mention all the animation are gpu vertex animation
    This is one game that take performance to the extreme

    https://www.zhihu.com/question/442555442
     
  7. Antypodish

    Antypodish

    Joined:
    Apr 29, 2014
    Posts:
    10,754
    I suggest you check out Unity DOTS samples and boids. There is school fish using boids. I can render 50k fish using this approach, on my 5 year old rig in editor. Now probably sample would run even better.

     
    Last edited: Apr 15, 2021
    varnon likes this.
  8. calabi

    calabi

    Joined:
    Oct 29, 2009
    Posts:
    232
    Thats nothing.



    You can download the code to try yourself as well. I can get 400,000 at 30fps on my computer, and probably more if optimised other stuff.
     
    Vacummus, Nyanpas and varnon like this.
  9. joshrs926

    joshrs926

    Joined:
    Jan 31, 2021
    Posts:
    111
    Thanks everyone for your replies! @Jes28 and @HellGate94 you guys are probably right. Perhaps my test would show different results if I follow your advice. But I think for now I have enough info to go off of. What I was trying to figure out was if one or the other wins in a landslide. Before I ran this test I was expecting the GPU to massively outperform the CPU because the GPU has way more cores. But now I sort of view DOTS and compute shaders on roughly the same level. Now I agree with @Antypodish that it really depends on your use case. I was about ready to totally rewrite my project to use compute shaders instead of DOTS and maybe I would have done that if this test showed compute shaders to be far far superior, but now I will just stick with DOTS since the performance is in the same ballpark.
     
  10. varnon

    varnon

    Joined:
    Jan 14, 2017
    Posts:
    52
    Yeah I've seen the Unity boids sample. I think the space partitioning gives a big boost to performance, though its hard to speculate exactly since Unity's boids are very different from mine. They also use a few separate schools, which is an easy way to double the overall number, but that to me is isn't the same. Still excellent performance with a single Unity school due to the space partitioning. I never got around to implementing the space partitioning because my goal at the time was really just to compare different approaches (compute shader, monobehavior + burst jobs, and ECS), and by the time I had done that, I was pretty satisfied and ready to move on. I might pick it back up one day and try to push it a little farther, but for now it served is purpose of comparison.
     
    Last edited: Apr 15, 2021
  11. berniegp

    berniegp

    Unity Technologies

    Joined:
    Sep 9, 2020
    Posts:
    42
    Hi @joshrs926, good on you to test your assumptions before blindly rewriting your algorithms! I would add that it's equally important to measure what you're looking to optimize to make sure that's where you need to direct your optimization efforts.

    Regarding your benchmark however, it's a bit flawed unfortunately:
    • It's measuring a roundtrip from a coroutine, creating a buffer, (asynchronously) executing a command on the GPU, asynchronously downloading the results from the GPU, and finally waiting for the next coroutine execution to stop the stopwatch. Therefore there are many more orders of magnitude of "overhead" stuff happening here on top of the GPU execution.
    • sin(sin(3));
      is a constant operation (the compiler computes it). Doing that 1024 times is probably in total less than 10 cycles per "thread". This is a trivial amount of work and just setting up the shader execution takes longer than the actual shader.
    • In the context of Unity, compute shaders are more useful for stuff that doesn't need to be downloaded back to the CPU (e.g. animating particles, processing a texture).
    • GPU execution time should be measured with some kind of GPU profiler. Since you're timing coroutines, both tests probably give around the same results simply because the coroutines are resumed once per frame so at ~16ms intervals for 60fps.
    • + some other more minor things ;)
    Don't feel bad though because benchmarks are hard to get right. Even when they are technically correct, it's really easy to unknowingly measure a use-case that differs from what we actually wanted to measure.

    As fast as DOTS can be, compute shaders can absolutely blow it out of the water for certain classes of workloads.

    Despite all this, I agree with your conclusion based on the test you did. You don't seem to have enough of a workload to warrant using compute shaders and the added complexity will just slow you down at this point.
     
  12. bb8_1

    bb8_1

    Joined:
    Jan 20, 2019
    Posts:
    100
    I would really love to see results of tests : amd(32/64 cores - providing os/unity can utilize all cores)-DOTS vs
    rtx 3090-CS - I'm sure CS will win(i guess in most tests but ofc not all) but not sure it will own DOTS
     
    Nyanpas likes this.
  13. snacktime

    snacktime

    Joined:
    Apr 15, 2013
    Posts:
    3,356
    There are a lot of people doing cool stuff with the gpu. But I think most are rather clueless about where to start. I was until I did more of a deep dive into this area.

    Focus on rendering first. Engines especially Unity due to some limited api's, barely scratch the surface in this area. This is highly likely where your biggest bang for buck items are. Requires more of a deeper knowledge of rendering then compute per say to leverage well.

    Gpu concurrency models often require some fairly complex approaches. A naïve implementation can hurt more then help, say by negatively impacting rendering.

    That said imperfect can be ok. The gains can be good enough for a naive implementation to still work. Nvidia for example has a whole suite of software that often performs orders of magnitude better then what engines do. In comparison the engine version is naïve. But of course Nvidia developers are uniquely qualified.

    Problems that fit well are generally well known. You don't need to go looking for where to use the gpu. If you have some specific problem you are solving and compute is a good fit, then just basic google research is going to tell you that.
     
  14. joshrs926

    joshrs926

    Joined:
    Jan 31, 2021
    Posts:
    111
    Thanks for the reply! It’s good to hear that compute shaders can blow DOTS out of the water in certain cases as that’s what I would expect since there’s generally way more cores than a cpu. Im getting the impression that Compute shaders are generally good when the results are needed on the GPU each frame or if the job is performed once in a while and it’s ok for the CPU to wait a bit for results, and when the computations can be split into roughly similar sized, simple chunks. My current project requires the job to be performed just once in a great while and it’s ok for the CPU to wait a while for the results, but it would be really hard to split up the job into simple equal chunks. Each chunk can vary in complexity/time to complete. So it seems best for now to do this on the CPU.
     
  15. Guedez

    Guedez

    Joined:
    Jun 1, 2012
    Posts:
    827
    I'd say avoid using compute shaders unless you got no other choice. I've spend a great deal of time messing with it for my grass, and converting it all to Burst + Jobs was the best thing I've done. Sure, there was no Burst + Jobs when I stared developing the grass, so it was literally the only way to get all that grass generated in a player's attention span, but It's much easier to interact with it when I don't need to ask the GPU to do thing and wait for the answer multiple frames in the future.
    There is still one compute shader I use that serializes a two dimensional array into a single dimensional array, so I only need to send which 'groups of up to 1024' blades of grass I want rendered, and the compute shader serializes the blocks into a single array for instanced rendering. That way I both leave some heavy work for the GPU, but I also drastically reduce the amount of data I send to it every frame. There is a very specific time and place to use Compute shaders.

    I am really hyped for the day we can send burstable jobs to the GPU without having to write a ton of code, since most of the burst restrictions are already basically the same restrictions you have on the GPU shaders anyways.
     
  16. Nyanpas

    Nyanpas

    Joined:
    Dec 29, 2016
    Posts:
    406
    I find this a fascinating thread and would like to see more usecases and/or examples that defines a GPU vs CPU-approach.
     
  17. berniegp

    berniegp

    Unity Technologies

    Joined:
    Sep 9, 2020
    Posts:
    42
    Some examples of compute shader usage in Unity are:
    Manipulating data already on the GPU (textures, meshes, etc.) in some way and using the result on the GPU is usually a clear win for compute shaders. This avoids costly CPU-GPU copies.

    When the result of an algorithm needs to be accessible on the CPU in some way, there is a good chance that burst compiled jobs will win over compute shaders. It all depends :)

    Compute shaders are harder to work with though since you can't inspect what they do as easily.
     
  18. PublicEnumE

    PublicEnumE

    Joined:
    Feb 3, 2019
    Posts:
    729
    Thank you for this breakdown. Super informative to read about how you guys use/think of Compute Shaders.
     
  19. berniegp

    berniegp

    Unity Technologies

    Joined:
    Sep 9, 2020
    Posts:
    42
    I've added something like "When to use compute shaders vs burst vs jobs" to my list of subjects I'd like to write a blog post about. I can see how it would be interesting for Unity devs. I have no idea when I'll have time to work on that, but it's noted :)
     
  20. atr0phy

    atr0phy

    Joined:
    Nov 5, 2014
    Posts:
    43
    Curious if this ever came to fruition.
     
  21. benthroop

    benthroop

    Joined:
    Jan 5, 2007
    Posts:
    262
    I would also like to read that, if you ever get around to it @berniegp
     
    Unifikation likes this.
  22. lclemens

    lclemens

    Joined:
    Feb 15, 2020
    Posts:
    760
    Here's an interesting fact.... you might have seen Ultimate Epic Battle Simulator 2 on Steam... and it handles over 1 MILLION 3D battling animated characters on the screen at the same time. I was curious so I researched how they did it. They basically built the ENTIRE GAME in the GPU! Animation, pathfinding, AI, damage and game logic, etc. Only a few things like audio remain outside of the GPU. It's pretty awesome to know that it can be done... but you have to be really really good at shader and GPU coding to pull it off!
     
    bb8_1 likes this.