Search Unity

ECS Culling

Discussion in 'Data Oriented Technology Stack' started by cage_au, Sep 10, 2019.

  1. cage_au

    cage_au

    Joined:
    Jul 16, 2017
    Posts:
    11
    I've written my own 2D sprite rendering system with a Compute Buffer and Geometry Shader but what I want to achieve is reducing the amount of data sent to the Compute Buffer in the most efficient manner. We've already packed the data down considerably into 24 bytes.

    I know how to tell if it is visible to the camera or not, that's the easy bit. The question is how to store this in an ECS component and then extract to only send what is required to the compute buffer.

    Currently my brother has coded a system that exports the data from ECS by collating it in a parallel job. We just send everything using SetData on the Compute Buffer.

    Obviously we could loop over the data and push it into a new array but I want to know if there is anything out there that could help solve this problem efficiently.

    Say for example i have 300k sprite components in ECS but only 30k are visible, when extracting data from ECS I only want to get the 30k not 300k.

    Some things I've thought about are:

    • SharedComponentData: Somehow update a value to say if visible or not then use the entity query filter, if possible i'd imagine each time the SCD value is changed it cost to move it around
    • Component Tag: Having a Tag to say if it is visible or not (added and removed to the entity by a system with a command buffer). Then filter with it on a query. Works but obviously not super fast when adding / removing.
    There are pros and cons to everything but just wondering what others may have done and if there is something I haven't thought about that we could try!
     
  2. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    436
    Tagging individual entities is going to be slow as it forces a sync point and does a bunch of memory restructuring. Since you are using ComputeBuffers, I would write the culling results to a NativeStream.ParallelWriter and then use ToArray to compact it and send to a ComputeBuffer.
     
  3. cage_au

    cage_au

    Joined:
    Jul 16, 2017
    Posts:
    11
    Thanks for that DeamingImLatios, we tried 100K sprites in ECS with the NativeStream, didn't know it existed :)

    However I also tried a NativeQueue with an IJob Dequeue on the render system with a persistent fixed size array and it seems to be much much faster than converting to a native array every frame. I might have 300k sprites in my world but realistically only a few thousand ever visible on screen when zoomed in at normal ranges you'd expect a player to be viewing at.

    We will probably revisit the NativeStream later I think but for now this is more than enough! Thanks so much for your suggestion though.

    Code (CSharp):
    1. [BurstCompile]
    2. private struct UpdateSprite2DBufferJob : IJob
    3. {
    4.     public int                         SpriteRenderCount;
    5.     public NativeQueue<Sprite2DBuffer> Sprite2DQueue;
    6.     public NativeArray<Sprite2DBuffer> Sprite2DArray;
    7.  
    8.     public void Execute()
    9.     {
    10.         for (int i = 0; i < SpriteRenderCount; i++)
    11.             Sprite2DArray[i] = Sprite2DQueue.Dequeue();
    12.     }
    13. }
     
  4. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    436
    Interesting. You are manually converting a NativeQueue into a NativeArray. When you tested converting the NativeStream to a NativeArray, did you do that conversion inside a Burst job as well?
     
  5. Tudor-Nita

    Tudor-Nita

    Joined:
    Dec 10, 2009
    Posts:
    336
    He skips the allocation part though (the array is fixed size ), which is by far the worst part when moving so much data around. Something similar could be achieved with NativeStream I guess but it certainly would not perform significantly better.

    Not a queue but have you looked into AsArray and it's ilk (AsDeferredJobArray, etc ) from NativeList ? Didn't really check the internals but it's probably just a pointer cast. It's significantly faster for me and my particular use-case (RaycastCommand abuse). A free-list approach with two NativeLists would essentially give you a queue anyway.

    I can picture the same cast being viable for NativeQueue with a bit of elbow-grease.
     
  6. cage_au

    cage_au

    Joined:
    Jul 16, 2017
    Posts:
    11
    Haven't tried AsDeferredJobArray, still a n00b in so many ways! For me once I've Dequeued it is straight into the compute buffer like below.

    Code (CSharp):
    1. _args[0] = spriteRenderCount;
    2. _argsBuffer.SetData(_args);
    3. _spriteBuffer.SetData(_sprite2DBufferArray, 0, 0, spriteRenderCount);
    4.  
    5. Graphics.DrawProceduralIndirect(_material, bounds, MeshTopology.Points, _argsBuffer, 0, null, null, ShadowCastingMode.Off, false, _layerMaskSprite);
    I don't need to slice it either because I'm telling DrawProcedural how many to draw and using SetData to only copy what I need (hopefully that's how it works), both array and compute buffers are initialised to 100k.

    I'm sure there is faster ways of doing it but I really don't know how to achieve that at the moment. I did toy around with the idea of getting rid of the the queue going straight to the NativeArray from the ParralellFor Job using an unsafe class with Interlocked.Increment(ref *m_Counter); to allocate to the array but didn't get it to work properly, If I could maybe even faster!
     
  7. cage_au

    cage_au

    Joined:
    Jul 16, 2017
    Posts:
    11
    Actually no, this is what my brother coded after your reply reading the very limited information out there about it. Maybe it can be done better? But we need this to be available for the render call.

    The ParrallelFor Job is populating the stream though.

    Code (CSharp):
    1. public NativeArray<Sprite2DBuffer> GetSprite2DBuffer(Allocator allocator)
    2. {
    3.      _updateSprite2DStreamJobHandle.Complete();
    4.      var sprite2DBuffer = _sprite2DStream.ToNativeArray<Sprite2DBuffer>(allocator);
    5.      _sprite2DStream.Dispose();
    6.      return sprite2DBuffer;
    7. }
     
  8. Tudor-Nita

    Tudor-Nita

    Joined:
    Dec 10, 2009
    Posts:
    336
    Massaging your queue into an ordered NativeList ( or two ) and using AsArray ( and co ) would remove the need to copy anything. Would just be a cast when feeding the list.AsArray to the buffer. The rest of your code would stay the same.

    If you don't use NativeQueue for the concurrent .Enque this will be a piece of cake. If you do, better skip this.
     
  9. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    436
    So I was originally thinking that the parallel writing to the NativeQueue would be slower than the parallel writing to the NativeStream because the NativeQueue would be doing a bunch of atomics. But it actually only does a few of them in blocks. That means that performance should be on par if not perhaps even a little better than NativeStream for enqueueing. Guess I learned something new today. However, you could convert the NativeStream to a NativeArray in a parallel job which is probably as optimal as you can possibly get for brute force culling.
     
  10. cage_au

    cage_au

    Joined:
    Jul 16, 2017
    Posts:
    11
    I have beat all possible different implementations I've done so far and I think I'll be sticking with this!

    3 systems now that all run in parallel after each other
    1. IJobForEachWithEntity: Overwrite data in the NativeArray when visible and also use another NativeArray of bool type to mark if the specific array index is visible or not

    2. IJobParallelForFilter: Filter on the NativeArray bool type to return a list of indices that should be merged into the final output NativeArray

    3. IJobParallelFor: Take the output of step 2, then push into the output NativeArray which will then be placed directly into the Compute Buffer
    All I can say is this appears to be the fastest of all the methods and note all of the buffers are persistent. A good learning experience and many thanks to everyone's input! :)

    So thats 3 NativeArrays to solve pushing data into one NativeArray. And I don't think NativeList's are supported for parallel but could be wrong Tudor-Nita.
     
    Last edited: Sep 13, 2019
  11. Tudor-Nita

    Tudor-Nita

    Joined:
    Dec 10, 2009
    Posts:
    336
    They are not. I mean, maybe you could, if you handle write safety yourself and so on. What you currently have is hard to beat.
     
  12. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    4,671
    You can use NativeList.AsArray() or NativeList.AsDeferredArray() giving you safe access to write to the elements while guranteeing that it won't be resized in parallel.
     
    pal_trefall likes this.
  13. cage_au

    cage_au

    Joined:
    Jul 16, 2017
    Posts:
    11
    Many thanks Joachim_Ante! I read up about AsDeferredArray. Initially had 3 job handle completes now only 1 on the main thread before rendering. Don't think it has had a huge performance increase though but I'd rather use this than have two extra job handle completes.

    For anyone that didn't know like me..

     
  14. cage_au

    cage_au

    Joined:
    Jul 16, 2017
    Posts:
    11
    So I ended up trying out an unsafe class with a pointer to increase the index each time it decided that the sprite was visible. I thought it would be faster than the three steps I described earlier but seems it might actually be just a tad slower, any idea why that would be? My uneducated guess would be it has to do something with the interlocking because I've gone down to just one array and one int now.

    Code (CSharp):
    1. [BurstCompile]
    2. private struct UpdateSprite2DArrayJob : IJobForEach<Sprite2D, LocalToWorld>
    3. {
    4.     [ReadOnly]
    5.     public int Sprite2DBufferMaxSize;
    6.  
    7.     [ReadOnly]
    8.     public float2 Min;
    9.  
    10.     [ReadOnly]
    11.     public float2 Max;
    12.  
    13.     [NativeDisableParallelForRestriction]
    14.     [WriteOnly]
    15.     public NativeArray<Sprite2DBuffer> Sprite2DArrayWriter;
    16.  
    17.     [NativeDisableUnsafePtrRestriction]
    18.     public int* Sprite2DCounter;
    19.  
    20.     public void Execute([ReadOnly] ref Sprite2D sprite2D, [ReadOnly] ref LocalToWorld localToWorld)
    21.     {
    22.         if (*Sprite2DCounter >= Sprite2DBufferMaxSize)
    23.             return;
    24.  
    25.         float x = localToWorld.Position.x;
    26.         float y = localToWorld.Position.z;
    27.         if (x < Min.x || y < Min.y || x > Max.x || y > Max.y)
    28.             return;
    29.  
    30.         int count = System.Threading.Interlocked.Increment(ref *Sprite2DCounter) - 1;
    31.         Sprite2DArrayWriter[count] = new Sprite2DBuffer(sprite2D.Index, sprite2D.State, ref localToWorld, sprite2D.Colour);
    32.     }
    33. }
    What would be the best to do if the results are fairly similar? Perhaps from Unity's perspective they would rather I stick with the 3 safe systems I made, Build, Filter then Combine?
     
  15. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    436
    Interlocks (atomics) can be pretty slow due to the synchronization and potential cache thrashing they can create. If it were me, I would look instead to optimize your inner loop.

    First, pack those Min and Max into a float4 minmax.
    Second, bit twiddle your max values to be a precision unit lower.
    Third, make a float2 xy = localToWorld.Position.xz;
    Fourth, your if condition can look like this: if (math.bitmask(xy.xyxy >= minmax) != 3)

    This A) vectorizes the comparison operation, B) avoids the generation of 4 branch checks which thrashes the branch predictor, and C) extracts the bool results and compares them against a mask literal to create a singular test.

    You might not totally care about the bit twiddling part since it looks like you are culling against the center point of the sprite and not the bounds of the sprite.

    I am very curious how much of a speedup this gives you.

    Edit: Wrong mask constant I think. It's either 12 or 3, but I'm too tired to get it straight.
     
    Last edited: Sep 15, 2019 at 5:02 AM
    cage_au likes this.
  16. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    4,671
    Agree. Using Atomics is almost always a horrible idea. Both in terms of code complexity / safety reviewability but also in terms of performance.

    Using atomics with multiple writers on the same field can easily be 100-1000x slower than just writing a normal variable when the contention is common. This is no joke.

    Seriously the amount of times programmers shoot themselves in the foot and actually make multithreaded code slower than single threaded code is intense.

    Here is a good read on the topic:
    https://fgiesen.wordpress.com/2014/08/18/atomics-and-contention/


    This is why in DOTS we make you jump through hoops to shoot yourself in the foot using those attributes.
     
  17. cage_au

    cage_au

    Joined:
    Jul 16, 2017
    Posts:
    11
    I think 3 works well, when I zoom in it speeds up. However the results are in, please see below :) 100k sprites using the interlocking method. Random samples over 20 seconds.

    OLD SYSTEM
    56.57ms
    57.25ms
    57.10ms

    UPDATED SYSTEM (with your suggestions)
    65.45ms
    63.45ms
    62.91ms

    I'm going to switch to the previous jobs now I think and compare results.
     
  18. cage_au

    cage_au

    Joined:
    Jul 16, 2017
    Posts:
    11
    Thanks yes it appears Atomics are very slow, I'm learning lots of stuff, hobby game programmer and good to know.

    I reverted to my previous systems as described in the middle of this thread with a Get, Filter and Set system. Results are super quick at 100k sprites vs the interlocking method!

    UPDATE
    4.23ms
    4.26ms
    4.85ms

    FILTER
    1.45ms
    1.46ms
    1.44ms

    SET
    1.32ms
    1.32ms
    1.35ms

    I guess you could say in total that is at worst 7.66ms compared to the 57ms it was taking with interlocking. Huge improvement!

    Note I was at full zoom out for all tests meaning all 100k sprites were visible.
     
  19. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    436
    Those four steps weren't meant to be individual optimizations but rather a singular optimization. They are also completely independent of the interlocking method (other than that I was going off of your interlocking method code to identify them). So put them in the Update step and I bet you could shave off another millisecond or two.
     
  20. cage_au

    cage_au

    Joined:
    Jul 16, 2017
    Posts:
    11
    Sorry when I said 3 I was referring to this.

    Here's the current job with your suggestions. Seems to improve a bit, any other ideas are appreciated and thanks for all your help!

    Samples with 100k sprites full zoom out.

    UPDATE
    4.26ms
    4.09ms
    4.04ms
    3.93ms
    4.41ms

    Code (CSharp):
    1. [BurstCompile]
    2. private struct UpdateSprite2DArrayJob : IJobForEachWithEntity<Sprite2D, LocalToWorld>
    3. {
    4.     [ReadOnly]
    5.     public float4 MinMax;
    6.  
    7.     [WriteOnly]
    8.     public NativeArray<Sprite2DBuffer> Sprite2DArrayWriter;
    9.  
    10.     [WriteOnly]
    11.     public NativeArray<bool> Sprite2DVisibleArrayWriter;
    12.  
    13.  
    14.     public void Execute(Entity entity, int index, [ReadOnly] ref Sprite2D sprite2D, [ReadOnly] ref LocalToWorld localToWorld)
    15.     {
    16.         if (index >= SpriteRenderingSystem.Sprite2DBufferMaxSize)
    17.             return;
    18.  
    19.         var xy = localToWorld.Position.xz;
    20.         if (math.bitmask(xy.xyxy >= MinMax) != 3)
    21.         {
    22.             Sprite2DVisibleArrayWriter[index] = false;
    23.             return;
    24.         }
    25.  
    26.         Sprite2DArrayWriter[index]        = new Sprite2DBuffer(ref sprite2D, ref localToWorld);
    27.         Sprite2DVisibleArrayWriter[index] = true;
    28.     }
    29. }
     
  21. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    436
    Huh. That didn't make as much of a difference as I would have expected. I wonder what is so expensive in that loop that makes that impact so small. I usually see a 2x increase or more with that style of optimization. I guess I would have to look at the compiled assembly to figure out what's going wrong. 4 ms seems way too slow for 100,000 entities in parallel.

    Oh, you know what? Are you profiling with safety checks and friends disabled?
     
  22. cage_au

    cage_au

    Joined:
    Jul 16, 2017
    Posts:
    11
    Nope! hehe, i can't repeat the test now as we had moved on but i have left your changes in place but thanks for your suggestions. I'm sure it will be faster with it disabled as you say.