Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

Could you fit your entire game into the L1 cache?

Discussion in 'General Discussion' started by Arowx, Nov 9, 2021.

  1. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Modern CPUs have amazing cache sizes e.g. 32,768 bytes for data and the same for code.

    OP Codes for x86 are one or two bytes in size so you can have about a minimum of 16,384 instructions to run your game.

    What if you combined your DOTS systems into one large system and instead of having lots of little systems interacting (and paying lots of overhead for management) your write just one that fits in the CPUs cache.

    OK this is just your game scripts and as soon as you step into the Unity Engine code your outside of this but your core game systems could fit into the L1 cache and it could give you optimal performance.

    This could be ideal for small and simple games but does the DOTS API allow for all your game data to be passed into a single system?

    It looks like all of it's filtering mechanics and pre-built systems (Translation) would get in the way of using DOTS this way.

    Is anyone using DOTS with large code systems?
     
    PutridEx likes this.
  2. Timboc

    Timboc

    Joined:
    Jun 22, 2015
    Posts:
    234
    I couldn't fit my entire game into the L1 cache, no.
    Feel free to mark this post as answered :rolleyes:
     
  3. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    Mostly. But you lose out on sanity, automatic dependency management, and version filtering. The latter is essential for not brute-forcing things. ;)

    My largest system is 980 lines and is a skinned mesh bindings reactive system. That's ignoring the modified HR V2 which was a spaghetti monster before I touched it.
     
    Occuros and xVergilx like this.
  4. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Just for reference in the 1980s entire games were written to fit in less memory than our current L1 caches, including graphics.

    So it can be done.



    full size -> Commodore_Game_Ads_3.jpg (5200×5400) (telparia.com)
     
    Last edited: Nov 9, 2021
    Joe-Censored likes this.
  5. Micz84

    Micz84

    Joined:
    Jul 21, 2012
    Posts:
    436
    For me it is an academic question. Yes you can write a game that will fit into the L1 cache, but this game will probably be so simple it does not matter much a least for non mobile game. In mobile games additional benefits of efficiency are less heat and battery usage.
     
    NotaNaN and colin_young like this.
  6. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Probably a better question is how large is the assembled opcode of your core game, and how can we check that out in Unity?

    Burst Compiler, IL2CPP IL assemblies, exe size, build console log?

    PS: and remember you could keep your game on chip in L2 (0.5 Mb) or L3 (4-8Mb) giving you way more space.
     
  7. OndrejP

    OndrejP

    Joined:
    Jul 19, 2017
    Posts:
    296
    This is pointless question, if you want your whole game to fit into 32K, you'd probably write that in assembler anyway. Heard or 512B demos, 4 KB demos? Those old school "intros"?
    xoofx (the main developer behind SharpDX and Burst) used to make those as well.
    These demos were limited in code size not actual memory usage.

    Here's example:
    (with slides how it was made - in video description)


    Trying to fit your code and DATA into 32K, but run it in context of Unity Engine? Why?
    It won't stay in cache anyway, inside UnityEngine.dll it will jump all over the memory. You gain nothing.
    Even if you fit everything into 32K, you won't get light-speed performance.
    Memory access is only "half" of the work, actual ALU/FPU instructions is the second "half".

    So unless you plan to do "nothing" with the data, you'll be limited by ALU (which is not common).

    I'd say being limited by ALU is also the goal of the DOTS and it does it pretty well.
    Things like linear memory access and auto-vectorization increase ALU utilization.

    To conclude:
    You don't need everything in L1 cache to have optimal performance.
    It's enough to have it in much slower RAM and start loading it in advance into cache to prevent delays. This is what CPUs are trying to do and linear memory access helps that a lot.
     
    Last edited: Nov 10, 2021
  8. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    What about Project Tiny?
     
  9. unity-freestyle

    unity-freestyle

    Joined:
    Aug 26, 2015
    Posts:
    45
    What is the point dude... And even if it does fit in the cache, so what?
     
    MadeFromPolygons, apkdev and OndrejP like this.
  10. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    It's the whole point of DOTS, performance. While DOTS goes out of it's way to ensure alignment and cache cohesion for data every time a System runs its code also has to be loaded from memory and stored in the L1 cache.

    So if you have lots of little systems being popped into and out of the L1 code cache, the DOTS systems will become the bottleneck not for data bandwidth but for opcode cache loading and unloading bandwidth.

    The question is how big are DOTS systems (all that boilerplate adds up) and how many can you run before you hit opcode L1 cache limits and start waiting for L2,L3 and RAM access latency times.

    Could a DOTS heavy Dotty game that uses a lot of small Systems be inherently slower than one that uses fewer larger systems?
     
  11. hippocoder

    hippocoder

    Digital Ape Moderator

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    Hi, it's perfectly acceptable for people to ask questions, no matter how obtuse or difficult. Arowx is trying to create discussion around things like this. The real answer is not to suppress this voice but to move to general discussion as it's not directly working with DOTS as it is today but generally theoretical. Thanks for understanding.

    Also if you don't like a post, don't reply.
     
    Wattosan, ippdev, NotaNaN and 3 others like this.
  12. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Could you leave this one in DOTS as I'm going to try and Benchmark DOTS with lots of small systems vs fewer larger ones... (could be a while).

    My theory is that we should see a similar performance stepping graph as we do with benchmarks that show performance vs data size.

     
  13. hippocoder

    hippocoder

    Digital Ape Moderator

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    No, it's moved. Theoretical discussion is a general subject. Actual discussion (questions and answers) about things belong in their relevant forums. This is why many complained this thread was spam. It is not "spam" if it is in general discussion.
     
  14. frosted

    frosted

    Joined:
    Jan 17, 2014
    Posts:
    4,044
    @hippocoder let's be honest - we should have a "shower thoughts with @Arowx" weekly thread sticky. :D
     
  15. hippocoder

    hippocoder

    Digital Ape Moderator

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    I like to encourage open minded thought. There is usually a place people can do this where others can opt-in to participate rather than exclude them. That's my way :)

    It's a dangerous hubris to limit rules to only what we know.

    Further posts - please keep it on topic/constructive.
     
    MadeFromPolygons and Antypodish like this.
  16. GimmyDev

    GimmyDev

    Joined:
    Oct 9, 2021
    Posts:
    157
    NOT with unity
     
    angrypenguin and OndrejP like this.
  17. Joe-Censored

    Joe-Censored

    Joined:
    Mar 26, 2013
    Posts:
    11,847
    Yeah I was going to say the binary for an old Atari or arcade console can probably fit without emulation. Modern game engines are of course not designed for those size constraints.
     
  18. Antypodish

    Antypodish

    Joined:
    Apr 29, 2014
    Posts:
    10,574
    Here is some more practical approach to squeeze gme in small memory space, than theories.

    I think my embedding broke?
    Can you fit a whole game into a QR code?
    https://youtube.com/watch?v=ExwqNreocpg
     
    Last edited: Nov 11, 2021
  19. hippocoder

    hippocoder

    Digital Ape Moderator

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    Or maybe in DNA. Who develops the game developer? hmmmm
     
    Antypodish likes this.
  20. Antypodish

    Antypodish

    Joined:
    Apr 29, 2014
    Posts:
    10,574
    This isn't that far fetched at all. Since there are researches, which work on storing data in DNA form.

    Random article on DNA data.
    https://daily.jstor.org/using-dna-as-a-memory-drive/

    Using DNA As a Memory Drive
    But I don't expect anytime soon, that DNA will sit on L1 cache of CPU:D
     
    Last edited: Nov 11, 2021
    hippocoder likes this.
  21. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,509
    Having a 32kb cache does not mean that the CPU will read your whole 32kb application into L1 and run it from there. It's going to depend entirely on how the CPU's caching algorithm works, and modern CPUs are clearly designed and optimised around multi-threaded execution. That's why they have those giant caches.

    Furthermore, you're going to be giving up a bunch of benefits by optimising for one piece of hardware while ignoring others. Even within your CPU itself. For example, each core typically has its own L1 cache, which means that you're going to take a performance hit as soon as you want to do multi-threading because multi-layered caches introduce significant overheads as soon as multiple cores are trying to access and work on the same parts of memory.

    A job system is usually a much better way to approach this. Your whole program doesn't have to fit in the cache. But a specific job is usually pretty small, and a job operating on data which is organised efficiently can take excellent advantage of the cache's pre-fetch algorithms for the data. If you're working at low level you can even give pre-fetch instructions which optimise this even further.

    Ultimately, the reason CPUs have those nice big caches these days is exactly this kind of parallel work pattern. Each thread can grab a chunk of data that isn't being used elsewhere, process it sequentially, and be loading ahead as it makes its way through. This minimises how often it's waiting on main memory access (a "cache miss"), which is important as main memory access is a shared resource among all cores.

    I suspect that modern CPU cache algorithms would be designed around "hyperthreading", as well. That's where each core has an extra set of registers so that instead of stalling when there's a cache miss, it can basically do an internal thread switch. I've not looked into it, but with that in mind I suspect that the giant cache is designed to be shared.
     
    OndrejP likes this.
  22. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,509
    And before you say something like "Ah, so we should design our engines to do different work in small modules which each take full advantage of a CPU core and its cache", note that that's exactly what a Job System is, and they're nothing new. ;)
     
  23. ShilohGames

    ShilohGames

    Joined:
    Mar 24, 2014
    Posts:
    2,984
    Arowx:
    We don't actually need to get an entire game into the L1 cache. The goal is to get certain important data within the game to fit neatly within the L1 cache. For example, in a scene with thousands of items moving around, we don't need to fit the textures and 3D models into the L1 cache, but we would want to arrange position and rotation data within an array of structs, so the CPU can easily cache and read ahead that data.
     
    unity-freestyle and OndrejP like this.
  24. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    With hyperthreading and higher register counts as well as ALU, FPU, MMX, SSE logic blocks in cores can the CPU ensure it maximised core logic block utilisation or would developers, compiler developers, Unity or OS developers need to combine algorithms for best performance e.g. Methods A uses some SIMD whereas B uses floats and ALU and might the really optimal way to run the code is to combine A+B and run them on one core?

    DOTS has done amazing things for cache flow (no pun intended) but if a core is sitting there only using a fraction of it's processing bandwidth although very fast could we utilise more of our cores potential?

    And if you can keep more of your code and data in cache then surely you are maximising the bandwidth available to you on the CPU.
     
    Last edited: Nov 12, 2021
  25. MadeFromPolygons

    MadeFromPolygons

    Joined:
    Oct 5, 2013
    Posts:
    3,875
    Could you / Is it possible? Probably.

    Should you / Should it be done? Definately not.

    This is also the copy + paste answer to 99% of Arowx threads :p

    I do enjoy the discussion points though ;)
     
    OndrejP and unity-freestyle like this.
  26. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Is poor code utilisation of cores the reason we are seeing CPU manufacturers opt for chips with mixed big and little cores?
     
  27. GimmyDev

    GimmyDev

    Joined:
    Oct 9, 2021
    Posts:
    157
    That's LITERRALLY how it's done:
    - developer care about O(x) algorithm implementation and possibly data alignment
    - the compiler look at the code pattern and optimized it using a library of code optimization and eventually vectorized what it can (see the burst compiler for a domain specific version)
    - the CPU look at the instruction cache and allocate instructions based on dependency to optimized throughput.

    Here is a tangential video about similar issue

    He show that premature optimization is bad and the compiler cover you already
    On top of that Branch prediction is rather good in modern cpu as per this benchmark
    https://blog.cloudflare.com/branch-predictor/

    if you are curious you can use this https://godbolt.org/ to see how any compiler optimize snippets of code, you will recognize automatique vectorization and alignement, and potentially where a compiler does thing better than another.

    Trivia, assembly is kinda no longer low level code, it's a machine specific "high level code" that is compiled down to microcode by the cpu itself. You no longer have access to the metal, so there is probably a limit to how clever you can try to be. Verilog on FGPA is the new metal.

    This is a small subset of the problem but hint at the bigger picture.
     
  28. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194

    This simple cube moving benchmark shows how performant DOTS (green bars, ms -> lower is faster) is at larger data sizes but also highlights how that performance has an overhead that is only overcome when you have more than 128 entities to process.

    So is there room for smaller faster within cache DOTS systems that process less data but at amazing speeds?
     
  29. GimmyDev

    GimmyDev

    Joined:
    Oct 9, 2021
    Posts:
    157
    Probably the paging, they use 4kb or 8kb chunks for data, 128 entities with a 4x4 matrix is 8kb, seems to checks out.
     
  30. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    I thought it was 16kb chunks for DOTS. It does highlight what happens when unstructured data exceeds the CPU caches though huge bumps in FPS going from 1k to 2k entities.

    I thought that my batch based structured data would perform better though and the Jobs benchmark is disappointing they are all Burst enabled so I'm probably doing something wrong here.
     
  31. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    1,131
  32. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194

    Worked on the Batch / Jobs rendering using Graphics.DrawMeshInstance().

    DOTS still not rendering yet, and now Jobs slower than batch (same rendering process)?
     
    Last edited: Nov 13, 2021