OK ECS works in 16K chunks but what is the instruction size limit in ECS?

Arowx · Nov 2, 2018

My limited understanding of CPUs is they have data and instruction cache limits or is it just one L1 cache for both data and instructions?

If they are seperate does ECS need to limit the size of algorithm it uses e.g. it's system size, to 16K or is this hardware dependent?

For instance could you wright an ECS system that is large but works well on most hardware as the algorithm and data fits on L1 caches that are 512kb in size but then performs poorly on hardware with caches smaller than this?

Is there a way for developers/unity to detect cache sizes, or statistics on this aspect of player hardware*.

*It's easy to get basic hardware statistical information from steam e.g. Ghz and RAM but finding out the exact CPU and cache sizes is not as simple, unless anyone has found a player stats system that links well to chip types and L1 cache sizes across all of Unity compatible hardware platforms.

Lurking-Ninja · Nov 2, 2018

Man, you have questions about really complex topics and they nearly impossible to describe properly in a forum post.
If you want to learn about the Intel instruction pipeline in a nutshell(!):
https://techdecoded.intel.io/resources/understanding-the-instruction-pipeline/
But this is a topic which people wrote books about so...
https://www.amazon.com/s/?url=search-alias=stripbooks&field-keywords=+cpu+architecture

DreamingImLatios · Nov 3, 2018

Arowx said: ↑

My limited understanding of CPUs is they have data and instruction cache limits or is it just one L1 cache for both data and instructions?
Click to expand...

This is heavily CPU architecture dependent. Some have separate caches. Some keep them together.

Arowx said: ↑

If they are seperate does ECS need to limit the size of algorithm it uses e.g. it's system size, to 16K or is this hardware dependent?
Click to expand...

Writing tighter loops over your data will reduce the number of cache lines the CPU has to chew through. But otherwise, CPUs have hardware branch prediction that's quite powerful. But data doesn't have such benefits, so focus on writing code that works well with the data and it will run well on pretty much any system.

Arowx said: ↑

Is there a way for developers/unity to detect cache sizes, or statistics on this aspect of player hardware*.
Click to expand...

Unity.Jobs.LowLevel.Unsafe.JobsUtility.CacheLineSize

eizenhorn · Nov 3, 2018

Arowx said: ↑

Is there a way for developers/unity to detect cache sizes, or statistics on this aspect of player hardware*.
Click to expand...

DreamingImLatios said: ↑

Unity.Jobs.LowLevel.Unsafe.JobsUtility.CacheLineSize
Click to expand...

It's constatnt 64 bytes and not depend on hardware.

Arowx · Nov 3, 2018

eizenhorn said: ↑

It's constatnt 64 bytes and not depend on hardware.
View attachment 325888
Click to expand...

LOL No way is that 64 bytes...

This diagram from Wikipedia shows that the Athlon 64 CPU (circa 2003) has a 64k L1 instruction cache (bottom left).

https://upload.wikimedia.org/wikipe...ple.svg/800px-Cache,hierarchy-example.svg.png

Or enough to easily fit in 4 ZX Spectrum's (16k, 1.5 48k).

eizenhorn · Nov 3, 2018

It’s 64 bytes. Cache line and cache size is different things.

3:36:00

https://docs.roguewave.com/threadspotter/2011.2/manual_html_linux/manual_html/ch03s02.html

Arowx · Nov 4, 2018

Very in depth video that compares Ryzen and Intel Skylake CPU instruction throughput...

The interesting bits are around the 11 minute mark where he looks into the instruction fetch bandwidth of the CPUs with Skylark chips managing 16 bytes of instructions per clock cycle from the L1 cache. Ryzen has a similar 16 bytes of instruction per clock cycle.

Note though the cores have different instruction sets e.g. Integer, Floating Point and Vector operations some of which can be run in parallel.

So in theory as long as you keep your ECS systems down to the minimum amount of instructions/ops needed you should be able to process multiple data items in a single clock cycle.

However if you run larger programs then multiple clock cycles will be needed just to load in the next set of instructions.

So what is the optimum size of an ECS systems instruction set, 16 bytes or less?

Or does the overhead of swapping between systems and passing data from them make writing longer programs beneficial where the reduced per process speed is made up for by reduced data passing between systems?

bryanmcnett · Nov 6, 2018

Arowx said: ↑

My limited understanding of CPUs is they have data and instruction cache limits or is it just one L1 cache for both data and instructions?

If they are seperate does ECS need to limit the size of algorithm it uses e.g. it's system size, to 16K or is this hardware dependent?
Click to expand...

tl;dr: There is no hard restriction on the size of code in an ECS system that is enforced by ECS itself, or by cache hardware.

Devices that can run the Unity Player have an instruction cache that is typically separate from the data cache, at least at the L1 level (the smallest, fastest level). It is possible to write a function so large that it doesn't fit into the instruction cache, but in practice this is less of a problem than with the data cache. In both cases, using slightly more memory than is in the cache results in a gradual falloff in performance.

The 16KB ECS data chunk size exists because data needs to be contiguous in memory up to about a hardware memory page in size, in order to get all the advantages of locality of reference. But, if data chunks were too large, then nearly-empty chunks would contain a lot of wasted space. A trade-off was made between being big enough for time-efficiency, and small enough for space-efficiency.

Arowx · Nov 7, 2018

bryanmcnett said: ↑

In both cases, using slightly more memory than is in the cache results in a gradual falloff in performance.
Click to expand...

Are there any benchmarks data on this gradual falloff that could help developers code faster/better ECS systems.

bryanmcnett said: ↑

The 16KB ECS data chunk size
Click to expand...

Why not have per system configurable chunk size e.g. 1k,2k,4k,8k,16k as some lighter systems might waste space or be able to work with much smaller chunks.

Smaller chunks could also allow inter-mixed batch processing for higher throughput e.g. a batch of integer ops with a batch of SIMD ops as seperate systems.

bryanmcnett · Nov 7, 2018

"Why not have per system configurable chunk size e.g. 1k,2k,4k,8k,16k as some lighter systems might waste space or be able to work with much smaller chunks.
Click to expand...

Bigger chunks have greater economy of scale and less overhead when almost full, and smaller chunks waste less memory when almost empty. We may someday consider supporting multiple chunk sizes, mainly for these reasons.

hippocoder · Nov 7, 2018

bryanmcnett said: ↑

Bigger chunks have greater economy of scale and less overhead when almost full, and smaller chunks waste less memory when almost empty. We may someday consider supporting multiple chunk sizes, mainly for these reasons.
Click to expand...

Would it make much difference, really? Seems to me you'd have to either really big or really small and most project will be both.

bryanmcnett · Nov 7, 2018

Arowx said: ↑

Are there any benchmarks data on this gradual falloff that could help developers code faster/better ECS systems.
Click to expand...

This is a complicated subject, and details vary wildly by platform and device. The goal is not generally to make your data have an overall size smaller than cache, which isn't even often possible. The goal is to minimize the amount of energy and time spent pulling data from RAM into cache, and strategies that are effective towards this goal tend to be the same, regardless of specific platform or cache size.

The simplest advice I can give is to access no more data than you need, and in as a predictable pattern as you can. If you can use 1/2 as many bits to store data, and with all else being equal, processing can proceed roughly twice as quickly. And, if you can access that data in a predictable order of 1, 2, 3, 4 .... N or something comparably simple, many platforms can accelerate the transfer of data.

This simple advice doesn't apply in all circumstances, but it makes for a good starting point.

Superheftig · Dec 31, 2018

Is it possible to change the 16k to another value for all chunks in all worlds right now?

I have the problem that i have a grid which has a predefined chunk size of 64/128/256 cells that i cannot change. Its important for me, that cells that are neighbors are also located close in memory, so my goal is to put one grid chunk into one ecs chunks.

Based on my current data layout of one cell, i can put only 52 grid cells into one ecs chunk of 16k size. This means that i need 2 chunks for my 64 cells and i waste more than 60% memory. Because the grid is super big with more than 1M cells this is not applicable.

As you pointed out the 16k is only a trade off. It would be nice if i could decide on my own what the best value is for my special case. I could set it to around 20k so my grid chunks would fit perfectly to my ecs chunks.

Antypodish · Jan 1, 2019

Superheftig said: ↑

As you pointed out the 16k is only a trade off. It would be nice if i could decide on my own what the best value is for my special case. I could set it to around 20k so my grid chunks would fit perfectly to my ecs chunks.
Click to expand...

There is probably incremental of power to 2 in terms of chunk size, to be most efficient.
Which means 8 / 16 / 32 / 64 etc. If that is however correct, your next suitable chunk size would be 32. Means you waste much of space per chunk anyway.

5argon · Jan 1, 2019

Antypodish said: ↑

There is probably incremental of power to 2 in terms of chunk size, to be most efficient.
Which means 8 / 16 / 32 / 64 etc. If that is however correct, your next suitable chunk size would be 32. Means you waste much of space per chunk anyway.
Click to expand...

I am not sure are there any performance implication of power of two memory on CPU. (Unlike graphic cards) The actual chunk is even allocated smaller so that the header part takes up the remaining space to become 16k

The only "alignment" I know is word-aligned memory so that assembly code can use align variant of the command. Each chunk had already ensured word alignment and one word is just a few bytes. So I think setting 20k sized chunk is possible without costing performance. But right now you may try changing the kChunkSize variable in ArchetypeManager.cs and see how that goes. (Just don't use serialization)

Superheftig · Jan 1, 2019

5argon said: ↑

But right now you may try changing the kChunkSize variable in ArchetypeManager.cs and see how that goes. (Just don't use serialization)
Click to expand...

Im using Unity 2018.3.02f and ecs 0.0.12 previev-21
I searched a little bit in the dll and the ArchetypeManager does not contain a variable kChunkSize. The class Chunk contains a constant kChunkSize=16128 which is never used. Never the less in the next few lines the number is hardcoded into two methods. So i can´t change it on my own without the source code.
I know that there is the github repo with the examples, but can i download also the full source code of ecs (And if yes, where?)

5argon · Jan 1, 2019

Superheftig said: ↑

Im using Unity 2018.3.02f and ecs 0.0.12 previev-21
I searched a little bit in the dll and the ArchetypeManager does not contain a variable kChunkSize. The class Chunk contains a constant kChunkSize=16128 which is never used. Never the less in the next few lines the number is hardcoded into two methods. So i can´t change it on my own without the source code.
I know that there is the github repo with the examples, but can i download also the full source code of ecs (And if yes, where?)
Click to expand...

Yeah I meant changing the source code.. Chunk struct is in that file. You can go to Library/PackageCache to find a copy of UPM package. (your change can be overwritten by update/by reimport all)

Superheftig · Jan 2, 2019

Works like a charm. It also seems that there is not perfmance penality (i have not profiled deeply), but the memory footprint is down!!

I copied the ecs folder into the Asset Folder and adapted the dependencies to make sure it does not get overriden when the dependencies are solved again. Ofc updating the ECS module needs now some manual work.

Thx, a lot for the help! I'm now much closer to what I want to achieve.

Search Unity

OK ECS works in 16K chunks but what is the instruction size limit in ECS?

Arowx

Lurking-Ninja

Guest

DreamingImLatios

eizenhorn

Arowx

eizenhorn

Arowx

bryanmcnett

Unity Technologies

Arowx

bryanmcnett

Unity Technologies

hippocoder

Digital Ape

bryanmcnett

Unity Technologies

Superheftig

Antypodish

5argon

Superheftig

5argon

Superheftig

Search Unity

Unity ID

Useful Searches

OK ECS works in 16K chunks but what is the instruction size limit in ECS?

Guest

Unity Technologies

Unity Technologies

Digital Ape

Unity Technologies