ECS Culling

cage_au · Sep 10, 2019

I've written my own 2D sprite rendering system with a Compute Buffer and Geometry Shader but what I want to achieve is reducing the amount of data sent to the Compute Buffer in the most efficient manner. We've already packed the data down considerably into 24 bytes.

I know how to tell if it is visible to the camera or not, that's the easy bit. The question is how to store this in an ECS component and then extract to only send what is required to the compute buffer.

Currently my brother has coded a system that exports the data from ECS by collating it in a parallel job. We just send everything using SetData on the Compute Buffer.

Obviously we could loop over the data and push it into a new array but I want to know if there is anything out there that could help solve this problem efficiently.

Say for example i have 300k sprite components in ECS but only 30k are visible, when extracting data from ECS I only want to get the 30k not 300k.

Some things I've thought about are:

SharedComponentData: Somehow update a value to say if visible or not then use the entity query filter, if possible i'd imagine each time the SCD value is changed it cost to move it around

Component Tag: Having a Tag to say if it is visible or not (added and removed to the entity by a system with a command buffer). Then filter with it on a query. Works but obviously not super fast when adding / removing.

There are pros and cons to everything but just wondering what others may have done and if there is something I haven't thought about that we could try!

DreamingImLatios · Sep 10, 2019

Tagging individual entities is going to be slow as it forces a sync point and does a bunch of memory restructuring. Since you are using ComputeBuffers, I would write the culling results to a NativeStream.ParallelWriter and then use ToArray to compact it and send to a ComputeBuffer.

cage_au · Sep 12, 2019

Thanks for that DeamingImLatios, we tried 100K sprites in ECS with the NativeStream, didn't know it existed

However I also tried a NativeQueue with an IJob Dequeue on the render system with a persistent fixed size array and it seems to be much much faster than converting to a native array every frame. I might have 300k sprites in my world but realistically only a few thousand ever visible on screen when zoomed in at normal ranges you'd expect a player to be viewing at.

We will probably revisit the NativeStream later I think but for now this is more than enough! Thanks so much for your suggestion though.

Code (CSharp):

[BurstCompile]

private struct UpdateSprite2DBufferJob : IJob

{

public int SpriteRenderCount;

public NativeQueue<Sprite2DBuffer> Sprite2DQueue;

public NativeArray<Sprite2DBuffer> Sprite2DArray;

public void Execute()

{

for (int i = 0; i < SpriteRenderCount; i++)

Sprite2DArray[i] = Sprite2DQueue.Dequeue();

}

}

DreamingImLatios · Sep 12, 2019

Interesting. You are manually converting a NativeQueue into a NativeArray. When you tested converting the NativeStream to a NativeArray, did you do that conversion inside a Burst job as well?

Tudor_n · Sep 12, 2019

DreamingImLatios said: ↑

Interesting. You are manually converting a NativeQueue into a NativeArray. When you tested converting the NativeStream to a NativeArray, did you do that conversion inside a Burst job as well?
Click to expand...

He skips the allocation part though (the array is fixed size ), which is by far the worst part when moving so much data around. Something similar could be achieved with NativeStream I guess but it certainly would not perform significantly better.

Not a queue but have you looked into AsArray and it's ilk (AsDeferredJobArray, etc ) from NativeList ? Didn't really check the internals but it's probably just a pointer cast. It's significantly faster for me and my particular use-case (RaycastCommand abuse). A free-list approach with two NativeLists would essentially give you a queue anyway.

I can picture the same cast being viable for NativeQueue with a bit of elbow-grease.

cage_au · Sep 12, 2019

Tudor-Nita said: ↑

He skips the allocation part though (the array is fixed size ), which is by far the worst part when moving so much data around. Something similar could be achieved with NativeStream I guess but it certainly would not perform significantly better.

Not a queue but have you looked into AsArray and it's ilk (AsDeferredJobArray, etc ) from NativeList ? Didn't really check the internals but it's probably just a pointer cast. It's significantly faster for me and my particular use-case (RaycastCommand abuse). A free-list approach with two NativeLists would essentially give you a queue anyway.

I can picture the same cast being viable for NativeQueue with a bit of elbow-grease.
Click to expand...

Haven't tried AsDeferredJobArray, still a n00b in so many ways! For me once I've Dequeued it is straight into the compute buffer like below.

Code (CSharp):

_args[0] = spriteRenderCount;

_argsBuffer.SetData(_args);

_spriteBuffer.SetData(_sprite2DBufferArray, 0, 0, spriteRenderCount);

Graphics.DrawProceduralIndirect(_material, bounds, MeshTopology.Points, _argsBuffer, 0, null, null, ShadowCastingMode.Off, false, _layerMaskSprite);

I don't need to slice it either because I'm telling DrawProcedural how many to draw and using SetData to only copy what I need (hopefully that's how it works), both array and compute buffers are initialised to 100k.

I'm sure there is faster ways of doing it but I really don't know how to achieve that at the moment. I did toy around with the idea of getting rid of the the queue going straight to the NativeArray from the ParralellFor Job using an unsafe class with Interlocked.Increment(ref *m_Counter); to allocate to the array but didn't get it to work properly, If I could maybe even faster!

cage_au · Sep 12, 2019

DreamingImLatios said: ↑

Interesting. You are manually converting a NativeQueue into a NativeArray. When you tested converting the NativeStream to a NativeArray, did you do that conversion inside a Burst job as well?
Click to expand...

Actually no, this is what my brother coded after your reply reading the very limited information out there about it. Maybe it can be done better? But we need this to be available for the render call.

The ParrallelFor Job is populating the stream though.

Code (CSharp):

public NativeArray<Sprite2DBuffer> GetSprite2DBuffer(Allocator allocator)

{

_updateSprite2DStreamJobHandle.Complete();

var sprite2DBuffer = _sprite2DStream.ToNativeArray<Sprite2DBuffer>(allocator);

_sprite2DStream.Dispose();

return sprite2DBuffer;

}

Tudor_n · Sep 12, 2019

Massaging your queue into an ordered NativeList ( or two ) and using AsArray ( and co ) would remove the need to copy anything. Would just be a cast when feeding the list.AsArray to the buffer. The rest of your code would stay the same.

If you don't use NativeQueue for the concurrent .Enque this will be a piece of cake. If you do, better skip this.

DreamingImLatios · Sep 13, 2019

So I was originally thinking that the parallel writing to the NativeQueue would be slower than the parallel writing to the NativeStream because the NativeQueue would be doing a bunch of atomics. But it actually only does a few of them in blocks. That means that performance should be on par if not perhaps even a little better than NativeStream for enqueueing. Guess I learned something new today. However, you could convert the NativeStream to a NativeArray in a parallel job which is probably as optimal as you can possibly get for brute force culling.

cage_au · Sep 13, 2019

I have beat all possible different implementations I've done so far and I think I'll be sticking with this!

3 systems now that all run in parallel after each other

IJobForEachWithEntity: Overwrite data in the NativeArray when visible and also use another NativeArray of bool type to mark if the specific array index is visible or not

IJobParallelForFilter: Filter on the NativeArray bool type to return a list of indices that should be merged into the final output NativeArray

IJobParallelFor: Take the output of step 2, then push into the output NativeArray which will then be placed directly into the Compute Buffer

All I can say is this appears to be the fastest of all the methods and note all of the buffers are persistent. A good learning experience and many thanks to everyone's input!

So thats 3 NativeArrays to solve pushing data into one NativeArray. And I don't think NativeList's are supported for parallel but could be wrong Tudor-Nita.

Tudor_n · Sep 13, 2019

cage_au said: ↑

And I don't think NativeList's are supported for parallel but could be wrong Tudor-Nita.
Click to expand...

They are not. I mean, maybe you could, if you handle write safety yourself and so on. What you currently have is hard to beat.

Joachim_Ante · Sep 13, 2019

You can use NativeList.AsArray() or NativeList.AsDeferredArray() giving you safe access to write to the elements while guranteeing that it won't be resized in parallel.

cage_au · Sep 14, 2019

Joachim_Ante said: ↑

You can use NativeList.AsArray() or NativeList.AsDeferredArray() giving you safe access to write to the elements while guranteeing that it won't be resized in parallel.
Click to expand...

Many thanks Joachim_Ante! I read up about AsDeferredArray. Initially had 3 job handle completes now only 1 on the main thread before rendering. Don't think it has had a huge performance increase though but I'd rather use this than have two extra job handle completes.

For anyone that didn't know like me..

NativeList<T>.AsDeferredJobArray()
Remarks
Pass a deferred array to a job when the list is populated or modified by a previous job. Using a deferred array allows you to schedule both jobs at the same time. (Without a deferred array, you would have to wait for the results of the first job before you scheduling the second.)
Click to expand...

cage_au · Sep 15, 2019

So I ended up trying out an unsafe class with a pointer to increase the index each time it decided that the sprite was visible. I thought it would be faster than the three steps I described earlier but seems it might actually be just a tad slower, any idea why that would be? My uneducated guess would be it has to do something with the interlocking because I've gone down to just one array and one int now.

Code (CSharp):

[BurstCompile]

private struct UpdateSprite2DArrayJob : IJobForEach<Sprite2D, LocalToWorld>

{

[ReadOnly]

public int Sprite2DBufferMaxSize;

[ReadOnly]

public float2 Min;

[ReadOnly]

public float2 Max;

[NativeDisableParallelForRestriction]

[WriteOnly]

public NativeArray<Sprite2DBuffer> Sprite2DArrayWriter;

[NativeDisableUnsafePtrRestriction]

public int* Sprite2DCounter;

public void Execute([ReadOnly] ref Sprite2D sprite2D, [ReadOnly] ref LocalToWorld localToWorld)

{

if (*Sprite2DCounter >= Sprite2DBufferMaxSize)

return;

float x = localToWorld.Position.x;

float y = localToWorld.Position.z;

if (x < Min.x || y < Min.y || x > Max.x || y > Max.y)

return;

int count = System.Threading.Interlocked.Increment(ref *Sprite2DCounter) - 1;

Sprite2DArrayWriter[count] = new Sprite2DBuffer(sprite2D.Index, sprite2D.State, ref localToWorld, sprite2D.Colour);

}

}

What would be the best to do if the results are fairly similar? Perhaps from Unity's perspective they would rather I stick with the 3 safe systems I made, Build, Filter then Combine?

DreamingImLatios · Sep 15, 2019

Interlocks (atomics) can be pretty slow due to the synchronization and potential cache thrashing they can create. If it were me, I would look instead to optimize your inner loop.

First, pack those Min and Max into a float4 minmax.
Second, bit twiddle your max values to be a precision unit lower.
Third, make a float2 xy = localToWorld.Position.xz;
Fourth, your if condition can look like this: if (math.bitmask(xy.xyxy >= minmax) != 3)

This A) vectorizes the comparison operation, B) avoids the generation of 4 branch checks which thrashes the branch predictor, and C) extracts the bool results and compares them against a mask literal to create a singular test.

You might not totally care about the bit twiddling part since it looks like you are culling against the center point of the sprite and not the bounds of the sprite.

I am very curious how much of a speedup this gives you.

Edit: Wrong mask constant I think. It's either 12 or 3, but I'm too tired to get it straight.

Joachim_Ante · Sep 15, 2019

Agree. Using Atomics is almost always a horrible idea. Both in terms of code complexity / safety reviewability but also in terms of performance.

Using atomics with multiple writers on the same field can easily be 100-1000x slower than just writing a normal variable when the contention is common. This is no joke.

Seriously the amount of times programmers shoot themselves in the foot and actually make multithreaded code slower than single threaded code is intense.

Here is a good read on the topic:
https://fgiesen.wordpress.com/2014/08/18/atomics-and-contention/

This is why in DOTS we make you jump through hoops to shoot yourself in the foot using those attributes.

cage_au · Sep 16, 2019

DreamingImLatios said: ↑

Interlocks (atomics) can be pretty slow due to the synchronization and potential cache thrashing they can create. If it were me, I would look instead to optimize your inner loop.

First, pack those Min and Max into a float4 minmax.
Second, bit twiddle your max values to be a precision unit lower.
Third, make a float2 xy = localToWorld.Position.xz;
Fourth, your if condition can look like this: if (math.bitmask(xy.xyxy >= minmax) != 3)

This A) vectorizes the comparison operation, B) avoids the generation of 4 branch checks which thrashes the branch predictor, and C) extracts the bool results and compares them against a mask literal to create a singular test.

You might not totally care about the bit twiddling part since it looks like you are culling against the center point of the sprite and not the bounds of the sprite.

I am very curious how much of a speedup this gives you.

Edit: Wrong mask constant I think. It's either 12 or 3, but I'm too tired to get it straight.
Click to expand...

I think 3 works well, when I zoom in it speeds up. However the results are in, please see below 100k sprites using the interlocking method. Random samples over 20 seconds.

OLD SYSTEM
56.57ms
57.25ms
57.10ms

UPDATED SYSTEM (with your suggestions)
65.45ms
63.45ms
62.91ms

I'm going to switch to the previous jobs now I think and compare results.

cage_au · Sep 16, 2019

Joachim_Ante said: ↑

Agree. Using Atomics is almost always a horrible idea. Both in terms of code complexity / safety reviewability but also in terms of performance.

Using atomics with multiple writers on the same field can easily be 100-1000x slower than just writing a normal variable when the contention is common. This is no joke.

Seriously the amount of times programmers shoot themselves in the foot and actually make multithreaded code slower than single threaded code is intense.

Here is a good read on the topic:
https://fgiesen.wordpress.com/2014/08/18/atomics-and-contention/

This is why in DOTS we make you jump through hoops to shoot yourself in the foot using those attributes.
Click to expand...

Thanks yes it appears Atomics are very slow, I'm learning lots of stuff, hobby game programmer and good to know.

I reverted to my previous systems as described in the middle of this thread with a Get, Filter and Set system. Results are super quick at 100k sprites vs the interlocking method!

UPDATE
4.23ms
4.26ms
4.85ms

FILTER
1.45ms
1.46ms
1.44ms

SET
1.32ms
1.32ms
1.35ms

I guess you could say in total that is at worst 7.66ms compared to the 57ms it was taking with interlocking. Huge improvement!

Note I was at full zoom out for all tests meaning all 100k sprites were visible.

DreamingImLatios · Sep 16, 2019

cage_au said: ↑

I think 3 works well, when I zoom in it speeds up. However the results are in, please see below 100k sprites using the interlocking method.
Click to expand...

Those four steps weren't meant to be individual optimizations but rather a singular optimization. They are also completely independent of the interlocking method (other than that I was going off of your interlocking method code to identify them). So put them in the Update step and I bet you could shave off another millisecond or two.

cage_au · Sep 16, 2019

DreamingImLatios said: ↑

Those four steps weren't meant to be individual optimizations but rather a singular optimization. T
Click to expand...

Sorry when I said 3 I was referring to this.

DreamingImLatios said: ↑

Edit: Wrong mask constant I think. It's either 12 or 3, but I'm too tired to get it straight.
Click to expand...

Here's the current job with your suggestions. Seems to improve a bit, any other ideas are appreciated and thanks for all your help!

Samples with 100k sprites full zoom out.

UPDATE
4.26ms
4.09ms
4.04ms
3.93ms
4.41ms

Code (CSharp):

[BurstCompile]

private struct UpdateSprite2DArrayJob : IJobForEachWithEntity<Sprite2D, LocalToWorld>

{

[ReadOnly]

public float4 MinMax;

[WriteOnly]

public NativeArray<Sprite2DBuffer> Sprite2DArrayWriter;

[WriteOnly]

public NativeArray<bool> Sprite2DVisibleArrayWriter;

public void Execute(Entity entity, int index, [ReadOnly] ref Sprite2D sprite2D, [ReadOnly] ref LocalToWorld localToWorld)

{

if (index >= SpriteRenderingSystem.Sprite2DBufferMaxSize)

return;

var xy = localToWorld.Position.xz;

if (math.bitmask(xy.xyxy >= MinMax) != 3)

{

Sprite2DVisibleArrayWriter[index] = false;

return;

}

Sprite2DArrayWriter[index] = new Sprite2DBuffer(ref sprite2D, ref localToWorld);

Sprite2DVisibleArrayWriter[index] = true;

}

}

DreamingImLatios · Sep 16, 2019

Huh. That didn't make as much of a difference as I would have expected. I wonder what is so expensive in that loop that makes that impact so small. I usually see a 2x increase or more with that style of optimization. I guess I would have to look at the compiled assembly to figure out what's going wrong. 4 ms seems way too slow for 100,000 entities in parallel.

Oh, you know what? Are you profiling with safety checks and friends disabled?

cage_au · Sep 17, 2019

DreamingImLatios said: ↑

Are you profiling with safety checks and friends disabled?
Click to expand...

Nope! hehe, i can't repeat the test now as we had moved on but i have left your changes in place but thanks for your suggestions. I'm sure it will be faster with it disabled as you say.

Search Unity

ECS Culling

cage_au

DreamingImLatios

cage_au

DreamingImLatios

Tudor_n

cage_au

cage_au

Tudor_n

DreamingImLatios

cage_au

Tudor_n

Joachim_Ante

Unity Technologies

cage_au

cage_au

DreamingImLatios

Joachim_Ante

Unity Technologies

cage_au

cage_au

DreamingImLatios

cage_au

DreamingImLatios

cage_au

Search Unity

Unity ID

Useful Searches

ECS Culling

Unity Technologies

Unity Technologies