In general, is it worth splitting an entity with a large number of components into smaller entities?

Abbrew · Nov 22, 2021

For example, the difference between these two archetypes

Soldier = ([Flanking Components],[Tactics Components],[Gun Components],[Morale Components],[Armor Components])

Soldier = () // This one is just a "folder"
Smart Flank = ([Subset of Flanking Components])
Panic Flank = ([Subset of Flanking Components])
Rookie Flank = ([Subset of Flanking Components])
Reload = ([Subset of Gun Components])
Aim = ([Subset of Gun Components])
Shoot ([Subset of Gun Components])
etc...

Most systems in my codebase operate only on a subset of a Soldier's components, or even a subset of a subset. Would splitting a huge entity like Soldier into more granular entities improve performance? There will be at most around 100 soldiers per scene, and each one is approaching 1kB in size

Enzi · Nov 22, 2021

When systems only operate on an archetype it's best to make the archetype size as small as possible. When systems need to get data from other archetypes, so random lookups for essentially the same entity it's best to make bigger ones.

That said, from my tests it doesn't make much difference how big the archetype is for the simple reason that most code is not vectorized. Only with vectorized code the difference of how much data is actually read in one cacheline is huge but if it's not, it hardly matters.
I hope I can explain this: If you read an item in the array of all reload components, the CPU actually gets 64 bytes of this array, not just one, which is let's say 4 bytes. If you're reading other components to the point that the previously read cacheline from the reload component vanishes from the L1 or even L2/L3 cache, having read those additional 60 bytes was pointless and weren't used to improve performance. I'd say, most jobs are guilty of this behaviour which also means it doesn't matter how big or small the archetype is because you're not taking advantage of it anyway. But, the smaller the archetype the better the chances!

(If I get something wrong, feel free to correct me)

Personally, I've a really tough time to get anything vectorized. My gamecode hardly allows it and I'm honestly not experienced enough to write and think of algorithms for vectorized code. The mindset of going from normal code to vectorized code is so much harder than anything I've ever done.
I don't know any experts on vectorization either so maybe it's more of a case, yeah, can't be done anyway. I just think, anything that follows a for loop and is a closed set of data should be possible to be vectorized somehow. I got only the most basic jobs vectorized. (automatically) Anything more complex than just summing up an array or something fails with automatic vectorization and would need to be rewritten with custom intrinsics.

DreamingImLatios · Nov 22, 2021

Enzi said: ↑

When systems only operate on an archetype it's best to make the archetype size as small as possible. When systems need to get data from other archetypes, so random lookups for essentially the same entity it's best to make bigger ones.

That said, from my tests it doesn't make much difference how big the archetype is for the simple reason that most code is not vectorized. Only with vectorized code the difference of how much data is actually read in one cacheline is huge but if it's not, it hardly matters.
I hope I can explain this: If you read an item in the array of all reload components, the CPU actually gets 64 bytes of this array, not just one, which is let's say 4 bytes. If you're reading other components to the point that the previously read cacheline from the reload component vanishes from the L1 or even L2/L3 cache, having read those additional 60 bytes was pointless and weren't used to improve performance. I'd say, most jobs are guilty of this behaviour which also means it doesn't matter how big or small the archetype is because you're not taking advantage of it anyway. But, the smaller the archetype the better the chances!

(If I get something wrong, feel free to correct me)

Personally, I've a really tough time to get anything vectorized. My gamecode hardly allows it and I'm honestly not experienced enough to write and think of algorithms for vectorized code. The mindset of going from normal code to vectorized code is so much harder than anything I've ever done.
I don't know any experts on vectorization either so maybe it's more of a case, yeah, can't be done anyway. I just think, anything that follows a for loop and is a closed set of data should be possible to be vectorized somehow. I got only the most basic jobs vectorized. (automatically) Anything more complex than just summing up an array or something fails with automatic vectorization and would need to be rewritten with custom intrinsics.
Click to expand...

You don't need vectorized code to get cache efficiency. Vectorized code just makes it so that you process more data at once meaning you need to be fed data faster, so you are more likely to burn more cycles waiting on memory. However, even scalar code can bottleneck the system with cache misses. Iterating sequential memory reduces this by
1) Loading adjacent elements in a single cache line
2) Activiating hardware prefetching, which will cause adjacent cache lines to be preloaded in succession up to the 4kB page boundary where you are forced to take a cache miss.

Abbrew said: ↑

There will be at most around 100 soldiers per scene, and each one is approaching 1kB in size
Click to expand...

This implies a chunk capacity of 15 or 16, which is low if your components are small (single ints or floats), but fine for vectors and matrices. You don't have many entities to begin with so don't prematurely optimize.

Krajca · Nov 22, 2021

Besides points listed in previous answers, I think splitting of is worth mainly when you add/remove components at high frequency. I split the AI of my units off of more static data. Where static in this context means no structural changes. That allows me to somewhat optimize.

Abbrew · Nov 22, 2021

Krajca said: ↑

Besides points listed in previous answers, I think splitting of is worth mainly when you add/remove components at high frequency. I split the AI of my units off of more static data. Where static in this context means no structural changes. That allows me to somewhat optimize.
Click to expand...

Interesting. So it's adding/removing, not updating, components which may necessitate splitting entities into smaller ones? My Soldier entities will be very large, and constantly write/read data, but their archetypes will be the same. I guess this means that having a ton of very large entities, even if not that many are in each chunk, is okay?

DreamingImLatios said: ↑

You don't need vectorized code to get cache efficiency. Vectorized code just makes it so that you process more data at once meaning you need to be fed data faster, so you are more likely to burn more cycles waiting on memory. However, even scalar code can bottleneck the system with cache misses. Iterating sequential memory reduces this by
1) Loading adjacent elements in a single cache line
2) Activiating hardware prefetching, which will cause adjacent cache lines to be preloaded in succession up to the 4kB page boundary where you are forced to take a cache miss.

This implies a chunk capacity of 15 or 16, which is low if your components are small (single ints or floats), but fine for vectors and matrices. You don't have many entities to begin with so don't prematurely optimize.
Click to expand...

Got it. I can share a more granular estimation of the Soldier entities' metrics: many small components (think 2 or 3 int fields), a few small DynamicBuffers, and a few very large DynamicBuffers (think 50-100 structs of around 16 bytes)

Enzi said: ↑

When systems only operate on an archetype it's best to make the archetype size as small as possible. When systems need to get data from other archetypes, so random lookups for essentially the same entity it's best to make bigger ones.

That said, from my tests it doesn't make much difference how big the archetype is for the simple reason that most code is not vectorized. Only with vectorized code the difference of how much data is actually read in one cacheline is huge but if it's not, it hardly matters.
I hope I can explain this: If you read an item in the array of all reload components, the CPU actually gets 64 bytes of this array, not just one, which is let's say 4 bytes. If you're reading other components to the point that the previously read cacheline from the reload component vanishes from the L1 or even L2/L3 cache, having read those additional 60 bytes was pointless and weren't used to improve performance. I'd say, most jobs are guilty of this behaviour which also means it doesn't matter how big or small the archetype is because you're not taking advantage of it anyway. But, the smaller the archetype the better the chances!

(If I get something wrong, feel free to correct me)

Personally, I've a really tough time to get anything vectorized. My gamecode hardly allows it and I'm honestly not experienced enough to write and think of algorithms for vectorized code. The mindset of going from normal code to vectorized code is so much harder than anything I've ever done.
I don't know any experts on vectorization either so maybe it's more of a case, yeah, can't be done anyway. I just think, anything that follows a for loop and is a closed set of data should be possible to be vectorized somehow. I got only the most basic jobs vectorized. (automatically) Anything more complex than just summing up an array or something fails with automatic vectorization and would need to be rewritten with custom intrinsics.
Click to expand...

Thanks. I don't think my game code lends well to being vectorized.

Ashkan_gc · Nov 23, 2021

In general asking questions about the general case will not lead to useful answers. DOD is about none-general solutions almost all the time.
You will need huristics like
- Am i processing some components together and some others together without much overlap? yes will move you toward splitting
- Am I processing almost all of thesec components together all the time? yes moves you toward keeping it as is.

Separate entities would mean more entities per archetype to process and more entities in a cache line (if the component size is small) so bigger components and biger entities are not good in this regrd but if you need to jump to all entities and all componenets and fields anyways, it doesn't help to split them

Guedez · Nov 23, 2021

I think the preferred answer would be to give some hard numbers about what burst does in the background, and general knowledge about cpu caches and what they mean for burst code.
I know very very little abou the subject besides the outlines of what probably is happening.

As far as I gather, the burst compiler will try to:
Use SIMD to process multiple entities on the same clock
Prefetch the next block of memory where the next entities are ahead of time so it does not need to wait for CPU <-> Memory overhead.

But what I have no idea is:
How many entities does SIMD processes at once
How many entities fit in a block of memory being prefetched
Does it mean it is ideal to keep the entity size so that (Number of entities in a memory block) is divisible by (number of entities processed per SIMD operation)?

I bet there are some hard numbers and considerations of what happens under the hood that I don't even know I don't know, so hopefully this post described the kind of information that could lead us to some informed decisions on the ideal entity size.

Antypodish · Nov 23, 2021

@Abbrew depends of your use case, if you many values, which are shared across entities, you may consider to use blobs references.

For 100 units you will not notice a difference anyway.
Neither for 1000. As long is bursted and jobified.

Also worth to look into, if some values maybe are not required to be stored on entities at all. I.e. constants (max health if it is not changing).

If you split data components over multiple entities, then you most likely need children, or reference entity components to these entities.
In such case, chances are you may want to need use GetComponentFromEntities.

So on one side you make your units and components more granual, and can greatly benefit in jobs. But on other hand, if you need process lot of data with accessing from reference entities, this may affect performance.

You need test for your use case. But definatelly, if you have structural changes, I.e. using components tags, you will be better of having smaller entities.

Again, for 1k units, wont make really any differences. You would need stress test for your use case scenario.

DreamingImLatios · Nov 23, 2021

Guedez said: ↑

As far as I gather, the burst compiler will try to:
Use SIMD to process multiple entities on the same clock
Prefetch the next block of memory where the next entities are ahead of time so it does not need to wait for CPU <-> Memory overhead.
Click to expand...

Actually, very rarely does the Burst compiler vectorize chunk iteration. It is almost exclusively overhead. All the Burst compiler is doing is translating math operations into native intrinsics (which uses simd instructions for vector types in the mathematics package) and being smart at optimizing register and stack usage in ways the mono runtime wouldn't normally allow for. Also, the Burst compiler doesn't do the memory prefetching. The hardware does it automatically. You can still observe speedups using linear memory access without Burst.

Guedez said: ↑

How many entities does SIMD processes at once
How many entities fit in a block of memory being prefetched
Does it mean it is ideal to keep the entity size so that (Number of entities in a memory block) is divisible by (number of entities processed per SIMD operation)?
Click to expand...

A typical cache line is 64 bytes, meaning it can hold 5 float3 instances or 8 float2 instances. Honestly, don't worry about your data layout in chunks too much unless you have really low chunk occupancy for lots of entities (usually caused by shared components or lots of dynamic buffers with default capacities in chunk) or have so many entities that performance starts to be a concern. And in the latter case, be ready to break out IJobEntityBatch because you'll need it to push performance further.

Guedez · Nov 23, 2021

So basically it does not matter at all?

DreamingImLatios · Nov 23, 2021

Guedez said: ↑

So basically it does not matter at all?
Click to expand...

When primarily using Entities.ForEach, it is more like there are a couple of sour spots to watch out for rather than a sweet spot to target. You can push beyond Entities.ForEach, but that requires a lot more time and effort. It is worth it for extremely hot codepaths, but in general isn't worth worrying about until you need more performance.

Arnold_2013 · Nov 24, 2021

So splitting the data of a "Unit" might not be low hanging fruit for more performance. But I assume it makes sense to split "UnitData"/RenderingData/PhysicsData?

In my game loop a big chunk of time is used by the havok physics, this I assume is highly optimized by smart people. So preventing extra data on this entity should improve performance of the physics step. So my game unit Collision/Trigger and Transform are on this entity and the "UnitData" has a link to this "unit PhysicsData Entity".

The rendering data takes up a lot of space in the chunk and in my game a lot of units are invisible (until the player sees them). So setting the rendering data to a separate Entity makes sense to me.

Guedez · Nov 24, 2021

Consider then that you would need to sync the rendering data and physics data every frame, and that will unquestionably be random access, so that alone might eat up all the performance gains from splitting. But I have no hard data to know for sure.

Enzi · Nov 24, 2021

I guess the more I think about it, splitting in ECS doesn't really make much sense unless random access doesn't happen per-frame.
And I also wonder if anything can really be so compartmentalized that you don't have random access. Even something very simple like a rendering entity, the LocalToWorld is random access by nature.
Really interested how some of you have solved this or if random access is totally natural and something we shouldn't worry about. (Doesn't mean to overdo it)
Which brings me to something I was pondering about with @DreamingImLatios, having 2 aligned chunk archetypes where access isn't local, but not random either.

DreamingImLatios · Nov 24, 2021

Enzi said: ↑

Which brings me to something I was pondering about with @DreamingImLatios, having 2 aligned chunk archetypes where access isn't local, but not random either.
Click to expand...

And for the record, I think it is a bad idea. If you have that large of entities where you need to split them and enough entities where the random accesses are measurable, then you either have an extreme server-only edge-case or are doing something wrong. In the case of the latter, make sure to size your components appropriately, set your dynamic buffer internal chunk capacities correctly, and use blobs for collections of read-only data. For the former, modify Entities to use a larger chunk size.

Enzi · Nov 24, 2021

I didn't mean exclusively for large entities. Netcode for example has the principle of having a game entity, the ghost, a render presentation and all have LocalToWorld for example in common which means lots of copying around. Netcode has some optimisations in place to use memcpy but it's still a copy nonetheless for basically the same data which could be shared. Something like ISharedComponentData doesn't really help in this case when it introduces new archetypes. The game entity and the ghost have more going on so in that sense it's not the best example, the render presentation though does not and is quite common in Entities projects. Best example is probably the Transform hierarchy system with Childs and Parents. I think that's flawed design from the ground up. Can also be seen when testing performance, it's pretty crappy.

For the future of entities something that solves this problem could be interesting, where a set of archetypes are bound with a set of one or more shared components between entities in the same chunk or just local in memory so the pointer access is faster. I admit, sounds complicated. I just have the belief that the design Entities has now isn't completely future proof because any form of hierarchy is bound to end in another chunk/archetype and random access and not every hierarchy is preventable.

BobFlame · Nov 25, 2021

I only make entity reference (in your case split components) when have no choice. It usually means several functions of the same entity should use the same component interface of other system. Since you can't add multiple same type components to single entity, you have to make an individual entity and link it to the main entity. In all other cases, I make components together, and it make your code cleaner and more optimized if multiple systems should communicate between each other.

MaNaRz · Nov 25, 2021

Correct me if i'm wrong but wasn't there a comment from Joachim somewhere that Unity is aware of the problem of huge Entities and that they are thinking about adding some kind of way to make the chunk size for specific Archetypes much larger? Wouldn't that solve all the problems discussed here?

Arnold_2013 · Nov 25, 2021

MaNaRz said: ↑

Correct me if i'm wrong but wasn't there a comment from Joachim somewhere that Unity is aware of the problem of huge Entities and that they are thinking about adding some kind of way to make the chunk size for specific Archetypes much larger? Wouldn't that solve all the problems discussed here?
Click to expand...

having 16kb chunk should be enough, even big entities will be able to fit in there a few times. I think you can make most games without splitting up entities, its already so fast compared to Mono/Managed unity. But if you want to really push it to the limit you should pack the data a system uses as compact as possible (I think/assume).

I have a "lifetimeSystem", it reduces a float value with deltatime and when the value is negative it adds a "DestroyTag". Currently this system gets all my unit data from different components, say 100 floats -> 400 bytes -> 40 entities per chunk) including the lifetime. (local to world is already 16 floats, so 100 might not even be "big")

The LifeTime floats are linear in memory within the chunk, so the LifeTimeSystem could/should SIMD the operation of reducing the float value with deltaTime. But after 40 entities it needs to refill the L1 cache with a new chunk to continue.

If I refactor and make the lifetime a single component on a new entity, I could fit 4000 entities in a chunk and the L1 cache would not need to be updated as often. Adding a Entity reference to the main entity I could poll if the lifetime is over, or adding an entity reference to the lifetime component I could add the Destroy Tag to the main entity (now the lifetime entity would only fit 1333 entities per chunk, because of the added reference).

Since the L1 cache is already fast and probably pre-fetched the next chunk before it is needed I doubt there will be significant gains to be made. But I do find it interesting to see what the optimal solution is, and we are in DOTS for the speed. So if a better splitting of entities could give a X% boost somehow it would be good to keep in mind while designing the entity archetypes.

Guedez · Nov 25, 2021

I actually want really small chunks rather than large ones. I aggressively use ISharedCompoentData to spatially split my entities. All my chunks are super empty.

Enzi · Nov 25, 2021

Arnold_2013 said: ↑

I have a "lifetimeSystem", it reduces a float value with deltatime and when the value is negative it adds a "DestroyTag".
Click to expand...

I have used this approach for quite some time but with network programming I ditched it. Introduce a TickManager and use ticks instead. If you already know which lifetime something has why not just have an end tick uint? That way you get rid of the inaccuracies and don't have to subtract a float every frame for your entities.

Search Unity

Unity ID

Useful Searches

In general, is it worth splitting an entity with a large number of components into smaller entities?