Search Unity

Floating Point Errors and Large Scale Worlds

Discussion in 'World Building' started by Phelan-Simpson, Apr 15, 2018.

  1. cosmochristo

    cosmochristo

    Joined:
    Sep 24, 2018
    Posts:
    250
    Each frame, the graphics pipeline multiplies all of the transforms in the Scene hierarchy above each object, into a total transform to position it. Adding one transform on top is not a significant change.

    Furthermore, if I move towards an object or I move the object towards me, the relative motion is identical: it is this same number of polygons moving in the same direction at the same time. In other words, no difference in performance of measurable worth.

    The idea that it would give a large performance hit only comes from algorithms that loop over every object to move them, instead of using the transform hierarchy.

    If by "physics world" you are referring to Rigid Body objects, then they are the same in this regard as other objects: RBs are a component attached to objects and the transform of an RB is the transform of the parent object. Actual physics is done by mathematical models, not polygons.

    The current CFO demo already has a bunch of green RB boxes that are transformed with the rest of the scene and they behave normally when something moves through/interacts with them. The Avatar is also shown colliding normally with objects and gravity also working. All this is done under the relative motion system I described.

    I will soon be making a demo with more physics/action that also moves everything via relative motion. That may be more convincing for you.
     
  2. cosmochristo

    cosmochristo

    Joined:
    Sep 24, 2018
    Posts:
    250
    The current CFO asset is half price and has the code - very simple.

    Sorry, I did not answer the projectile question. I have not tried it, so I don't know what changes may be required. Logically, all the relative motion is the same - unless I am missing something!
     
  3. alex2008wxyz

    alex2008wxyz

    Joined:
    Jul 1, 2020
    Posts:
    5
    This is exactly what happens on Minecraft: every power of 2 of distance from 0; 0; 0 rendering and game's behaviour becomes worse and worse due to 32-bit or 64-bit precision error
     
    cosmochristo likes this.
  4. Fribur

    Fribur

    Joined:
    Jan 5, 2019
    Posts:
    136
    Having read this entire thread multiple times: I still do not get that. Also, I have a sense this might be true for "Non-DOTS" / Monobehaviour as there to-be-updated transforms will never be arranged in a way that they nicely fit into the CPU cache lines. Therefore updating transforms in Monobehaviour world is bottlenecked by random RAM access (and not by arithmetic like multiplications). In contrast, "Data Oriented Design" (see DOTS forum), basically says the opposite of the quote above: Updating Millions of independent transforms (that are nicely arranged in chunks) happens in an instant as it hardly needs any RAM access, and walking a transform hierarchy for sequential multiplications (with random access to child transforms in RAM) will take forever as each access potentially waits for hundreds of CPU cycles until the data has been fetched from DRAM. Have a look into
    com.unity.entities@0.50.0-preview.24\Unity.Transforms\LocalToParentSystem.cs


    Code (CSharp):
    1. void ChildLocalToWorld(float4x4 parentLocalToWorld, Entity entity, bool updateChildrenTransform)
    2. {
    3.     [...]
    4.     if (ChildFromEntity.HasComponent(entity))
    5.     {
    6.         var children = ChildFromEntity[entity];
    7.         for (int i = 0; i < children.Length; i++)
    8.         {
    9.             ChildLocalToWorld(localToWorldMatrix, children[i].Value, updateChildrenTransform);
    10.         }
    11.     }
    12. }
    As you can see, from a given root transform(-matrix), all children are (randomly(!)) accessed in an recursive fashion. So advice from Unity (somewhere in the forum) was to keep hierarchies as flat and wide as possible so that the transform system can blast through the updates. Having all transforms under one root would be the worst case scenario.
     
    Last edited: Mar 29, 2022
  5. cosmochristo

    cosmochristo

    Joined:
    Sep 24, 2018
    Posts:
    250
    Hi @Fribur, those are good points and this is an area (performance and caching) that I spend a lot of time on.

    Note, all my comments and implementations, examples, relate to single thread, standard pipeline code. Something anyone can use no matter what other API or pipeline that have. They also apply at the top application level and are not restricted to some lower level close to or inside a pipeline. I do not use other things like HDRP or DOTS (and have no objections to them). One day, I will extend to multi thread and parallel versions, if there is ever have time :).

    I do not see the reason for the apparent assertion that if I am not talking about DOTs then my comments are somehow not valid to the discussion. DOTs or not, general purpose algorithm complexity, code optimisations, analysis and implementations are always relevant to application.

    The simplest continuous floating origin implementation constantly updates the root transform, in the monobehaviour.
    I would be very surprised if this was not very cache persistent.

    More sophisticated versions have more roots, a flatter hierarchy, but overall very small number of root nodes compared to the number of scene nodes and therefore, should also be very cache persistent.

    For n=<number of Scene children/transforms> the above recursively traverses the entire Scene tree. I would estimate the entire execution is order n at best (based on commonly cited tree traversal complexity), although there may be some optimisations that, on average, improve that.

    The simplest CFO example is order constant: one transform executed by the mono per player move. For more sophisticated solutions it is order (n) where n is the number of root nodes. Since the root nodes in my system is << number of scene entity transforms it is always going to be much faster just based on the complexity figure. In addition, I do not have the overheads of a recursive algorithm method calls like ChildFromEntity.HasComponent(entity).

    Furthermore, I never saw the point in telling Unity to update a child that I have not modified! The graphics pipeline will process all the transforms per frame, so doubling up on that is a waste of CPU resources.

    My approach is therefore invariant (or very close) to the scene complexity.

    The code example above slows down with scene complexity (size of n): no invariance.

    As for recursion! For large n, at the full recursive depth, the stack resources would be, quite frankly, staggering. "hardly any ram access"? if you define stack as not ram, perhaps. Also most unlikely to fit it all in cache.

    Anyway, in my testing, my approach produces very good frame rates for complexity, in my view, but perhaps the approach you describe above some how defies all my logic and does better. Or maybe it uses multiple threads. I am working on single thread only until I have exhausted all optimisations I can do with that.

    Example one test: In the link below I demonstrate 1million billboards in a Scene, av 24fps or more, on a 2.3GHz laptop (2019-2020 16" macbook pro), single thread. I think this is pretty good, I am wrong?
     
    Last edited: Mar 29, 2022
  6. Fribur

    Fribur

    Joined:
    Jan 5, 2019
    Posts:
    136
    Many Thanks for the swift and extensive reply. Regrettably I still do not "get it" how attaching the entire scene to one root cannot have a negative performance impact (as appears to be the case for most others in this thread here).

    In particular I do not understand this at all:
    What does the graphics pipeline do exactly where in relation to moving transform hierarchies? Could you maybe point that out in the attached profiler screenshots?

    Anyhow, I wiped up a quick test, which regrettably appears to confirm huge impact on performance by putting 270,000 sprites under one root, for both Monobehaviour and DOTS. I am more than happy to make you Co-Author for the repository: perhaps you can make some changes that would demonstrate your findings of not having a performance impact? Here the editor statistics:
    1. DOTS 270,000 sprites, no update of transforms at all: 470 FPS
    2. DOTS 270,000 sprites, 270,000 transform updates per frame: 140 FPS
    3. DOTS 270,000 sprites, 1 root transform update per frame: 45 FPS

    1. MONO 270,000 sprites, no update of transforms at all: 6 FPS
    2. MONO 270,000 sprites, 270,000 transform updates per frame: 3.2 FPS
    3. MONO 270,000 sprites, 1 root transform update per frame: 1.6 FPS

    BTW: we are in agreement here: random access (=recursive walking of hierarchy is) is always terrible in ECS and in MONO world (="HasComponent" in ECS world should be false, see 140 FPS vs 45 FPS above):
    Profile of DOTS Scenario (2):
    DOTS270000updates.png
    Profile of DOTS Scenario (3) (notice the added 15ms?):
    DOTS1root_update.png


    (If you like to play around with my test project: The Monobehaviour is attached to the GameObject "root" and can be enabled/disabled there + you can change spawnUnderSceneRoot bool there. For ECS just comment out the entire file SpriteSpawner.cs to disable it. You would need to edit bool spawnUnderSceneRoot directly in script.)
     
    Last edited: Mar 29, 2022
  7. cosmochristo

    cosmochristo

    Joined:
    Sep 24, 2018
    Posts:
    250
    Hey @Fribur, this is good stuff. I would very much like to play with these tests and gain an "education" on DOTS.

    "BTW: we are in agreement here: random access (=recursive walking of hierarchy is) is always terrible in ECS and in MONO world (="HasComponent" in ECS world should be false," thanks for the clarification.

    Re cache and stack. I always think about cache when trying to optimise. The main thing I was trying to say about the recursing to great depth is the CPU/core will automatically cache stack in use and deep recursion will therefore take up a lot of cache and this is a resource that cannot be hogged without bad consequences. So, depending on cache size, depth of recursion and how many other parts of the application use the cache I expect there to be some size n where there are a lot of cache misses caused by the recursive method competing with the rest of the application.

    I know there are smart circuits that mitigate cache misses and do write back and stuff - I don't understand it all. So for single thread stuff I think the unbounded recursion will cause random bad performance issues.

    My statements about single root node performance relates to my testing for single core, single thread apps. I have so far no data on parallel versions.
    Does your DOTs code increase the number of threads and cores with the number of root nodes? I am guessing the difference in frame rates relate to parallel execution.
    Does Unity itself automatically allocate multiple threads to root nodes? (this was suggested by someone a log while back but I could find no verification).

    I love recursive algorithms: elegant, compact, very good for achieving a lot while sitting in cache. However, it the depth of recursion is not limited and predictable I believe it would not be good in practice. For this reason I only use it in strongly constrained situations and generally never for the main runtime.

    Anyway, I have always wanted to try exploiting parallel execution but I am still very busy with the focus on single thread stuff. Thanks for the co-dev offer, I am more than happy to add a version of my Continuous Floating Origin and Dynamic Resolution Spaces (dynamically uses multiple roots) to the project (no cost) for joint performance exploration. I will have to concentrate for about a week on the next changes to my multiplayer CFO first.
     
  8. Fribur

    Fribur

    Joined:
    Jan 5, 2019
    Posts:
    136
    Nope, all benchmarked transform updates of the 270,000 sprites in the tests above are purely single threaded jobs: that it takes only 0.5ms to update 270,000 transforms stems only from the fact that all of them are stored consecutive in memory and can be fetched for processing by CPU in very few reads into L1 cache lines. And it takes 15ms to update them when attached to 1 root!!! Therefore I currently implement my planet scale Floating Origin System by massive update of all scene transforms every 5000 units (and not by rooting them to one parent) in a blink of an eye…but I am open to change to CFO if it is truely better/simpler/„optimal“…I really try understand your approach, but failed so far to understand the fundamental benefit/lack of negative performance consequences and I am not convinced.

    DOTS does indeed also leverage big performance gains from running job in parallel, processing entities of a given job in parallel, the BURST compiler etc and you see in the screenshot that the hybrid renderer system as well as the TRSToLocalToWorld system do a lot of that (see green walls across all cores)…but this has nothing to do explain the performance difference between dots approach (2) and (3): that difference is purely single threaded stuff.
     
    Last edited: Mar 30, 2022
  9. cosmochristo

    cosmochristo

    Joined:
    Sep 24, 2018
    Posts:
    250
    Ah I see and that is perfectly reasonable. This single-thread cache persistence optimisation is really valuable and I will have to start analysing the trade-offs we are discussing. I will need your help for that because I know nothing about DOTS and really don't have the time/resources right now.

    I will start by setting up a test using my billboards experiment setup.

    Before doing so, can we make an agreed joint statement to test against? Here is my version:

    Note: I have edited the text already to be more specific about the meaning of caching to distinguish it from code caching of code/objects/data, where the programming takes a reference to something and stores in the variable that is used at high frequency (e.g. per frame). An implication of such programmer caching is that such variables will be auto-cached into the CPU cache by the CPU/CU.

    Assertion (edited): For single-thread Unity application:
    Applying a translation operation to the entire World scenegraph of objects, via a single root node transform, is slower than applying a translation operation to each transform of a flattened hierarchy of objects, where the transforms are organised as a continuous block of memory. The reason for the difference in performance is primarily because the block of transforms are automatically put into CPU cache, by the CPU, and remain cache-persistent.

    Also I would like to know if your floating origin code is using a continuous reverse translation, where the player never moves (from the origin), or is a threshold based origin rebasing method, that allows the player to move away from the origin up to a threshold distance and then snaps everything back?
     
    Last edited: Mar 31, 2022
  10. Fribur

    Fribur

    Joined:
    Jan 5, 2019
    Posts:
    136
    This gives a pretty good understanding on DOTS:
    https://learn.unity.com/course/dots-best-practices

    https://forum.unity.com/threads/dots-development-status-and-next-milestones-march-2022.1253355/

    Rewording the assertion would effectively be rewriting the DOTS intro linked above and motivate why Unity is on a mission making DOTS an integral part of the engine in the years to come, so I will refrain from that.

    To be clear: I am not on a mission to convince anybody to use DOTS, or compare Mono against DOTS. I am simply trying to figure out if I would benefit from implementing a relatively minor/trivial change in my floating origin system: currently the player can move up to a threshold and then everything snaps back. And I am contemplating if I continue to snap back up to 200,000 entities every now and then, or root them and snap back the root only. Or even implement doing the snap back on every user input („continuous FO, reverse transform). But by now I am pretty certain this is too wastefull.

    Plus my benchmarks above indicate that it makes no difference if DOTS or MONO is used: relatively speaking moving one scene root appears always inferior performance wise to moving all objects individually: and that is the only thing I am interested in: did I do something wrong? My repository should be pretty self explanatory, I hope. (The Monobehavior part is very similar to what you posted in this thread except I spawn 1000 instances each of 270 different sprites/meshes and not just 1 to be bit more closer to reality).
     
    cosmochristo likes this.
  11. cosmochristo

    cosmochristo

    Joined:
    Sep 24, 2018
    Posts:
    250
    Ah - I completely misunderstood that part.

    That is actually not what floating origin ever did, floating origin never moves the player, and it should not be called floating origin. I am ever so grateful that the Epic / unreal devs called this origin rebasing: a name that reflects the actual algorithm. I started putting "continuous" before the term to distinguish it from the falsely named shifty methods (and published it as such in Healing Cracks in Cyberspace (2019) because the false naming was misleading developers as to what the method really did. Can you please use a more accurate name?

    I realised this morning that I already had a CFO mass object test program that made heavy use of contiguous transforms cached in the code, which should get good CPU cache consistency. So I am going to upgrade that to Unity 2019 (from 2018) and measure performance on that. Hopefully I can measure some differences in having 1 and more than 1 hierarchy under CFO, and maybe I won't be too lazy to make a version with no hierarchy :-/

    I doubt it, although that does not correspond with my experience I did not compare the two.
    I will test against this specific thing with my test program and report back.
     
  12. Crossway

    Crossway

    Joined:
    May 24, 2016
    Posts:
    509
    Doesn't work with light probe I can't move light probes too :(
     
    Marcos-Elias likes this.
  13. cosmochristo

    cosmochristo

    Joined:
    Sep 24, 2018
    Posts:
    250
    Hi @Crossway What doesn't work, DOTS, CFO? I am not familiar with probes.
     
  14. ruudvangaal

    ruudvangaal

    Joined:
    May 15, 2017
    Posts:
    27
    cosmochristo likes this.
  15. cosmochristo

    cosmochristo

    Joined:
    Sep 24, 2018
    Posts:
    250
    To quote someone from my discord "you bake the lighting with all static flag's set, once the baking is done you can unstatic it". I am not sure about probes.
    Yes periodic/shifting methods have all kinds of problems. In my view they are not a general solution.
    For most applications, you are not transforming the whole scene because parts of the scene that are not visible/active can be turned off/ignored or not even parented. For large complex worlds you could have most of the World not involved in any one payer's field of view and interaction at any one time.
    At any rate, I do not get performance issues with continuous floating origin, it offers performance improvements in areas where shifting solutions don't, and I have not even applied all intended optimisations yet.

    I am curious, what are these vertical issues in HDRP?
     
  16. Velctor

    Velctor

    Joined:
    Dec 5, 2020
    Posts:
    3
    if you are making space game\software, please give up everything related to light baking - almost everything will be dynamic\realtime.
     
    cosmochristo likes this.
  17. Wilhelm_LAS

    Wilhelm_LAS

    Joined:
    Aug 18, 2020
    Posts:
    55
    If i understood correctly, at the end of conclusion, using correct CFO(Continuous Floating Origin) by reverse transforming a parent node(A World Transform node on top of everything) is literal solution for SJ(Spatial Jittering).

    I made a test. Which is spawning exactly 10.000 colliders inside a sphere like exactly the same of what you did.
    Using SetActive seems not efficient. Its giving better result when i dont use SetActive trick in Unity V2022.3.7f1

    I also read your article about fixing the SJ(Spatial Jittering). Therefore, what you said in this article is, most of the developers misunderstood the Floating Origin methods somehow. I hope i understood correctly and thank you for that article it was really helpful to someone like a young developer like me.
     
    Last edited: Oct 8, 2023
  18. cosmochristo

    cosmochristo

    Joined:
    Sep 24, 2018
    Posts:
    250
    Thanks @Wilhelm_LAS , useful to know your results and thanks for testing it. I question everything myself so I welcome people like you doing objective independent tests.

    I'm not sure what the setactive is referring to - unless it is an old post of mine where I once found better results on an old laptop. I later found it did not have any benefit on better hardware and more recent unity. So I don't use it now - except for some LOD style things. But one has to avoid calling it too frequently.

    Yes, your conclusions ar correct: most developers use the old unity wiki-style shifty system that also has loops to explicitly move all objects, physics etc. It is mystifying, unless it is simply coming from not having enough depth of knowledge about scenegraphs and graphics pipelines.
     
    Wilhelm_LAS likes this.
  19. cosmochristo

    cosmochristo

    Joined:
    Sep 24, 2018
    Posts:
    250
    Some important things about this:
    1. It is not so much that using the scene root is the solution but that is an efficient implementation of 100% relative motion for a stationary observer (player/viewpoint) that never moves from the origin.
    2. There are two reasons why this is guaranteed to be better than an observer that moves by changes in absolute coordinate values:
      • The resolution of floating point space is most dense at the origin, and therefore all observations (calculations) like rendering, and physics, are most accurate there.
      • Relative motion is a proven math. Hawking points this out repeatedly. He also paraphrases Einstein's law of motion: effectively: "for all moving observers ... only the relative information" is important. They were both talking about physics. I realised it also applies directly to computation.
    3. These laws apply to all position information: not just space but time, rotation, you name it. I use temporal floating origin as well: the player is always at time=0.
     
    Wilhelm_LAS likes this.