Search Unity

OK ECS works in 16K chunks but what is the instruction size limit in ECS?

Discussion in 'Entity Component System' started by Arowx, Nov 2, 2018.

  1. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    My limited understanding of CPUs is they have data and instruction cache limits or is it just one L1 cache for both data and instructions?

    If they are seperate does ECS need to limit the size of algorithm it uses e.g. it's system size, to 16K or is this hardware dependent?

    For instance could you wright an ECS system that is large but works well on most hardware as the algorithm and data fits on L1 caches that are 512kb in size but then performs poorly on hardware with caches smaller than this?

    Is there a way for developers/unity to detect cache sizes, or statistics on this aspect of player hardware*.

    *It's easy to get basic hardware statistical information from steam e.g. Ghz and RAM but finding out the exact CPU and cache sizes is not as simple, unless anyone has found a player stats system that links well to chip types and L1 cache sizes across all of Unity compatible hardware platforms.
     
  2. garryguan likes this.
  3. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,271
    This is heavily CPU architecture dependent. Some have separate caches. Some keep them together.

    Writing tighter loops over your data will reduce the number of cache lines the CPU has to chew through. But otherwise, CPUs have hardware branch prediction that's quite powerful. But data doesn't have such benefits, so focus on writing code that works well with the data and it will run well on pretty much any system.

    Unity.Jobs.LowLevel.Unsafe.JobsUtility.CacheLineSize
     
  4. eizenhorn

    eizenhorn

    Joined:
    Oct 17, 2016
    Posts:
    2,685
    It's constatnt 64 bytes and not depend on hardware.
    upload_2018-11-3_13-11-3.png
     
  5. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
  6. eizenhorn

    eizenhorn

    Joined:
    Oct 17, 2016
    Posts:
    2,685
    hippocoder likes this.
  7. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Very in depth video that compares Ryzen and Intel Skylake CPU instruction throughput...


    The interesting bits are around the 11 minute mark where he looks into the instruction fetch bandwidth of the CPUs with Skylark chips managing 16 bytes of instructions per clock cycle from the L1 cache. Ryzen has a similar 16 bytes of instruction per clock cycle.

    Note though the cores have different instruction sets e.g. Integer, Floating Point and Vector operations some of which can be run in parallel.

    So in theory as long as you keep your ECS systems down to the minimum amount of instructions/ops needed you should be able to process multiple data items in a single clock cycle.

    However if you run larger programs then multiple clock cycles will be needed just to load in the next set of instructions.

    So what is the optimum size of an ECS systems instruction set, 16 bytes or less?

    Or does the overhead of swapping between systems and passing data from them make writing longer programs beneficial where the reduced per process speed is made up for by reduced data passing between systems?
     
    Last edited: Nov 4, 2018
  8. bryanmcnett

    bryanmcnett

    Unity Technologies

    Joined:
    Aug 16, 2018
    Posts:
    12
    tl;dr: There is no hard restriction on the size of code in an ECS system that is enforced by ECS itself, or by cache hardware.

    Devices that can run the Unity Player have an instruction cache that is typically separate from the data cache, at least at the L1 level (the smallest, fastest level). It is possible to write a function so large that it doesn't fit into the instruction cache, but in practice this is less of a problem than with the data cache. In both cases, using slightly more memory than is in the cache results in a gradual falloff in performance.

    The 16KB ECS data chunk size exists because data needs to be contiguous in memory up to about a hardware memory page in size, in order to get all the advantages of locality of reference. But, if data chunks were too large, then nearly-empty chunks would contain a lot of wasted space. A trade-off was made between being big enough for time-efficiency, and small enough for space-efficiency.
     
    Last edited: Nov 6, 2018
  9. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Are there any benchmarks data on this gradual falloff that could help developers code faster/better ECS systems.

    Why not have per system configurable chunk size e.g. 1k,2k,4k,8k,16k as some lighter systems might waste space or be able to work with much smaller chunks.

    Smaller chunks could also allow inter-mixed batch processing for higher throughput e.g. a batch of integer ops with a batch of SIMD ops as seperate systems.
     
  10. bryanmcnett

    bryanmcnett

    Unity Technologies

    Joined:
    Aug 16, 2018
    Posts:
    12
    Bigger chunks have greater economy of scale and less overhead when almost full, and smaller chunks waste less memory when almost empty. We may someday consider supporting multiple chunk sizes, mainly for these reasons.
     
  11. hippocoder

    hippocoder

    Digital Ape

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    Would it make much difference, really? Seems to me you'd have to either really big or really small and most project will be both.
     
  12. bryanmcnett

    bryanmcnett

    Unity Technologies

    Joined:
    Aug 16, 2018
    Posts:
    12
    This is a complicated subject, and details vary wildly by platform and device. The goal is not generally to make your data have an overall size smaller than cache, which isn't even often possible. The goal is to minimize the amount of energy and time spent pulling data from RAM into cache, and strategies that are effective towards this goal tend to be the same, regardless of specific platform or cache size.

    The simplest advice I can give is to access no more data than you need, and in as a predictable pattern as you can. If you can use 1/2 as many bits to store data, and with all else being equal, processing can proceed roughly twice as quickly. And, if you can access that data in a predictable order of 1, 2, 3, 4 .... N or something comparably simple, many platforms can accelerate the transfer of data.

    This simple advice doesn't apply in all circumstances, but it makes for a good starting point.
     
    recursive likes this.
  13. Superheftig

    Superheftig

    Joined:
    Mar 4, 2015
    Posts:
    14
    Is it possible to change the 16k to another value for all chunks in all worlds right now?

    I have the problem that i have a grid which has a predefined chunk size of 64/128/256 cells that i cannot change. Its important for me, that cells that are neighbors are also located close in memory, so my goal is to put one grid chunk into one ecs chunks.

    Based on my current data layout of one cell, i can put only 52 grid cells into one ecs chunk of 16k size. This means that i need 2 chunks for my 64 cells and i waste more than 60% memory. Because the grid is super big with more than 1M cells this is not applicable.

    As you pointed out the 16k is only a trade off. It would be nice if i could decide on my own what the best value is for my special case. I could set it to around 20k so my grid chunks would fit perfectly to my ecs chunks.
     
  14. Antypodish

    Antypodish

    Joined:
    Apr 29, 2014
    Posts:
    10,780
    There is probably incremental of power to 2 in terms of chunk size, to be most efficient.
    Which means 8 / 16 / 32 / 64 etc. If that is however correct, your next suitable chunk size would be 32. Means you waste much of space per chunk anyway.
     
  15. 5argon

    5argon

    Joined:
    Jun 10, 2013
    Posts:
    1,555
    I am not sure are there any performance implication of power of two memory on CPU. (Unlike graphic cards) The actual chunk is even allocated smaller so that the header part takes up the remaining space to become 16k

    The only "alignment" I know is word-aligned memory so that assembly code can use align variant of the command. Each chunk had already ensured word alignment and one word is just a few bytes. So I think setting 20k sized chunk is possible without costing performance. But right now you may try changing the kChunkSize variable in ArchetypeManager.cs and see how that goes. (Just don't use serialization)
     
    Antypodish likes this.
  16. Superheftig

    Superheftig

    Joined:
    Mar 4, 2015
    Posts:
    14
    Im using Unity 2018.3.02f and ecs 0.0.12 previev-21
    I searched a little bit in the dll and the ArchetypeManager does not contain a variable kChunkSize. The class Chunk contains a constant kChunkSize=16128 which is never used. Never the less in the next few lines the number is hardcoded into two methods. So i can´t change it on my own without the source code.
    I know that there is the github repo with the examples, but can i download also the full source code of ecs (And if yes, where?)
     
  17. 5argon

    5argon

    Joined:
    Jun 10, 2013
    Posts:
    1,555
    Yeah I meant changing the source code.. Chunk struct is in that file. You can go to Library/PackageCache to find a copy of UPM package. (your change can be overwritten by update/by reimport all)
     
  18. Superheftig

    Superheftig

    Joined:
    Mar 4, 2015
    Posts:
    14
    Works like a charm. It also seems that there is not perfmance penality (i have not profiled deeply), but the memory footprint is down!!

    I copied the ecs folder into the Asset Folder and adapted the dependencies to make sure it does not get overriden when the dependencies are solved again. Ofc updating the ECS module needs now some manual work.

    Thx, a lot for the help! I'm now much closer to what I want to achieve.
     
    5argon likes this.