Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. We have updated the language to the Editor Terms based on feedback from our employees and community. Learn more.
    Dismiss Notice
  3. Join us on November 16th, 2023, between 1 pm and 9 pm CET for Ask the Experts Online on Discord and on Unity Discussions.
    Dismiss Notice
  4. Dismiss Notice

Burst / DOTS Systems Performance Behaviour / Documentation

Discussion in 'Burst' started by jashan, Aug 6, 2020.

  1. jashan

    jashan

    Joined:
    Mar 9, 2007
    Posts:
    3,306
    I'm currently learning DOTS and doing some experiments to get a feeling for the performance behavior with different numbers of entities and different layouts for components. I'm using Unity 2019.4.5, Entities 0.11.1.

    In my current experiment, I create 1.000.000 entities and have a system that checks whether each entity (or the one component it has) is currently active by checking Time.ElapsedTime against a time interval stored in the component.

    In the EntityDebugger and Systems pane, with 1 million entities, that system takes about 0.02ms. So, to put some "load" into system, I wrapped the actual code into a loop. Doing it 100 times changed almost nothing (still about 0.02ms). At 1000, it suddenly went up to 20ms. 500 had 10ms. So far, so good, even if going from 100 to 500 seems odd (0.02ms to 10ms) but probably something "breaks" there between 100 and 500, and then it's linear from 500 to 1000. My guess would be that this is loop vectorization being possible at 100 but no more at 500 (that's apparently not it, though).

    Then, I realized I had a bug and replaced an if-statement with a bool assignment. Time went back down to 0.02ms (still with the 500 loop, and even when I increased that loop to 1000 iterations).

    So I thought "hm, okay, branching in burst-compiled code is very bad for performance". Except when I added the if-statement back in, time stayed at 0.02ms. Removed the bool assignment (but kept the if-statement), and BOOM, back to 20ms.

    That's factor 1000 worse performance by removing a simple assignment.

    Turns out when I include gem.SpawnOrDestroyThisFrame = gem.IsActive != nowActive; I get only one "loop not vectorized" warning in the Burst Inspector, on the line of the for-statement. When I remove it, I get two of these messages, both on the line of the if-statement. Changing the number of iterations (100, 500, 1000) doesn't make a difference in that regard.

    Now, I could probably spend the next two weeks trying to figure out all the possibilities but ... is there some documentation that explains what kind of statements / coding constructs have which performance impact? I did find Burst User Guide but from reading that, I don't understand why adding this line improves performance / changes vectorization like it does. I even tried assigning gem.IsActive != nowActive to a temporary bool that I then use in the if-statement but that doesn't seem to change anything.

    Here's the code of the system:

    Code (CSharp):
    1. protected override void OnUpdate() {
    2.     double time = Time.ElapsedTime;
    3.     Entities
    4.         .ForEach((ref GameplayEventMovement gem) => {
    5.             for (int i = 0; i < 1000; i++) {
    6.                 gem.CurrentTime = time;
    7.                 bool nowActive = gem.SpawnTime < time  && time < gem.DestroyTime;
    8.                
    9.                 // *removing* this => performance 1000 times *worse*
    10.                 gem.SpawnOrDestroyThisFrame = gem.IsActive != nowActive;
    11.                
    12.                 // keeping or removing this makes no difference, if above statement is present
    13.                 if (gem.IsActive != nowActive) {
    14.                     gem.SpawnOrDestroyThisFrame = true;
    15.                 }
    16.  
    17.                 gem.IsActive = nowActive;
    18.             }
    19.         })
    20.         .WithName("CheckGameplayEventActivityJob")
    21.         .ScheduleParallel();
    22. }
    23.  
     
  2. 5argon

    5argon

    Joined:
    Jun 10, 2013
    Posts:
    1,554
    Could be great if we could see the struct declaration of `GameplayEventMovement`.

    But in this case I am guessing it maybe a bug of vectorizer that tries to vectorize a loop, where the looping variable i is not being used anywhere? Since every round is performed on to the same `gem`, the correct result should be that the 1000 loop cannot be vectorized because next round may take some data from previous round. (gem.IsActive, though the time was frozen and the result must be the same)

    Or maybe that the outer vectorizer (outer loop in the source code which you can't see, that is iterating on each `gem`, which should be mutually exclusive to each other) got into problem with analyzing the inner loop and decide that even the outer loop can't be vectorized?

    ps. I think the best way is to make both versions that has the if and is faster and the one without and is slower, write 2 perf tests with https://docs.unity3d.com/Packages/com.unity.test-framework.performance@latest, then just send the project to Unity so you don't have to debug into potentially a bug
     
    Last edited: Aug 6, 2020
    jashan likes this.
  3. jashan

    jashan

    Joined:
    Mar 9, 2007
    Posts:
    3,306
    That would be

    Code (CSharp):
    1. public struct GameplayEventMovement : IComponentData {
    2.     public double SpawnTime;
    3.     public double ImpactTime;
    4.     public double DestroyTime;
    5.  
    6.     public double CurrentTime;
    7.  
    8.     public bool IsActive;
    9.     public bool SpawnOrDestroyThisFrame;
    10. }
    11.  
    CurrentTime is only for testing / debugging purposes (makes things a little more convenient for me in the Entity Debugger).

    Maybe adding that loop for testing purposes wasn't the greatest idea because it introduced "funny stuff". At least I did learn a little bit about loop vectorization in Burst but I do wonder what other oddities I'll run into.

    I guess the best approach for learning DOTS is taking really just one single step at a time, and then profiling on all target devices (this will end up on Windows, Quest, PS4 and probably Linux), then take the next one, single, small step?
     
    Last edited: Aug 6, 2020
  4. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    5,203
    Hey jashan, long time...


    One recommendation I have for getting started with DOTS is to not over optimise.

    With DOTS you very very good performance if you just follow basic rules.
    * All game code is bursted. (No managed classes in any of the game code, wherever possible)
    * Write parallel code where you can
    Using Entities.ForEach().ScheduleParallel() for most of your code is a really good starting point

    This gets you really solid performance. It seems like you are able to handle a million entities. As a starting point, maybe thats a place to take a step back and just say thats totally fine and you dont need to go much further as a first step? it's probably ~100-200x speedup compared to what you are used to getting out of the box.

    It looks like the code you posted here is doing that already...


    Now if you want to go further with optimization thats cool, with DOTS you can go all the way to the limit of the hardware. I think after following just the basic rules, the next step is to really understand how hardware actually works and write your code accordingly. In order to really get a good handle of it, it helps to understand how SIMD actually works.

    This talk is a good intro on how to use the lowest level API's talking to the hardware directly:


    Writing with intrinsics isnt necessary to get great performance but understanding what the hardware does helps a lot.
    The most important thing in relation to your code above, is to avoid branches at all cost. If you have branches, the compiler can't unroll / vectorize your code.

    In your code one way of doing this is like this:


    Code (CSharp):
    1. gem.CurrentTime = time;
    2. bool nowActive = (gem.SpawnTime < time) & (time < gem.DestroyTime);
    3.  
    4. // *removing* this => performance 1000 times *worse*
    5. gem.SpawnOrDestroyThisFrame = gem.IsActive != nowActive;
    6.  
    7. // keeping or removing this makes no difference, if above statement is present
    8. gem.SpawnOrDestroyThisFrame = math.select(gem.SpawnOrDestroyThisFrame, true, gem.IsActive != nowActive);
    9.  
    10. gem.IsActive = nowActive;
     
    Last edited: Aug 6, 2020
  5. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    5,203
    I would expect that the reason for the 1000x speedup is because burst is extremely good at detecting code that has no side effects and completely eliminates it. I am going to bet that the reason for taking out the SpawnOrDestroyThisFrame line is so much faster, is because the resulting behaviour is that burst can prove that running the loop 1000 times is unnecessary & it can just calculate the value once and skip the loop...
     
    charleshendry, Shinyclef and jashan like this.
  6. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    5,203
    A good way of finding out is to open the burst inspector and look at the actual generated disassembly. It can help you understand what the compile actually did with your loop.
     
    Shinyclef and jashan like this.
  7. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    5,203
    Lastly... using doubles is something you should avoid whereever you can for performance reasons. Thats a good baseline advice when writing simple game code.
     
    Shinyclef and jashan like this.
  8. jashan

    jashan

    Joined:
    Mar 9, 2007
    Posts:
    3,306
    Hey Joachim, thanks for chiming in! I remember the conversation we've had about this a looong time ago, and it's really cool to finally get to play with it!

    So my initial thought that branching can be a problem in DOTS was right ... and I must not ever add fake-loops for trying to slow things down a little bit because it makes very bad things happen ;-)

    Is my assumption correct that when I stay within ECS/DOTS, instantiating and destroying entities more or less randomly is fine (unlike in the GameObject world where you really need to pool things or GC will eventually bite you)? In my use case, I'm talking about maybe hundreds of objects at any given time, usually with a life-time of a few seconds.
     
  9. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    5,203
    Right now the answer is depends...

    If you instantiate lots of the same prefab 50000 times, thats very very fast. You can instantiate around 50k instance in 2ms.
    var array = EntityManager.Instantiate(prefab, 50000);

    If you instantiate 1 at a time, that can still add up. Its still orders of magnitude faster than game object, but its worth nothing that right now the batched path is much better. There is still plenty we can and will do on making the single instantiate fast path much faster. Definately a core goal to make worrying about instantiate cost be a non-thing.

    Also important to note that instanting objects doesn't allocate GC memory & also doesn't allocate unsafe memory. It all goes through pooled memory internally.
     
    NotaNaN likes this.
  10. 5argon

    5argon

    Joined:
    Jun 10, 2013
    Posts:
    1,554
    @jashan About your struct, it maybe related or not but your struct is 8*4 + 2 (bool = 1 byte?) = 34 bytes, which is an unfortunate size since common cache line size is 64 bytes, if you could reduce by just 2 bytes then you get 2 `gem` in one read. Supposed that Burst optimize nothing and logic goes exactly as your code, the first data read is at gem.CurrentTime, which if ECS aligns the data, it may goes back to the beginning of `gem` (SpawnTime), then fetch you 64 bytes getting this entire `gem` plus almost entirely of the next `gem`, but missing the final 2 booleans of the next `gem`. So in the next outer loop, you are missing the last few `bool` in your `gem` because the previous fetch can't reach it. Separating the last 2 bools into a new kind of component might help if you are keeping double data type. Now you get 2 `gem` on one read and then get many pairs of booleans in an another read.

    About slowdown the best way might be just stare at the assembly .. so I have tried pasting your job and view them. Seeing the code with 1000 loop in both version and it seems both have the 999 assign then `dec` one by one to nothing with no logic in between at all routine, then it could jump out. Therefore we can conclude that the work has been optimized to be done before the loop in both versions, we can remove the loop and try again.

    b-faster1.png
    b-slower1.png

    (notice that +32, +33 here is likely it accessing both your booleans)

    After removing the loop, we can compare 3 cases :

    - The one with both compare-assign and the `if` (faster)
    - The one with only compare-assign (you didn't mention it, but I want to try)
    - The one with only `if` (commented out only the compare-assign and is now 1000x slower)

    All 3 cases it still report that loop (outer, since we removed 1000x inner ones) still could not understood by the vectorizer as said in the debugger. The reason is likely that the contiguous data is too big (4x double + 2x boolean), but you assign only a bit of it (the last double and the booleans) and it is having hard time using vector version of mov command to do work on only those bits.. it is likely better if you separate the final double CurrentTime to be alone in a new component, and also 2 bools to be together in a new component. Maybe it could vectorize then.

    Ignoring the vectorizing problem (note that the addressing is for some reason inverted, above all these pics has rax + 33 and now -9 arrives at your 4th double, -1 to your bool and so on.) :

    upload_2020-8-6_18-37-53.png

    upload_2020-8-6_18-46-47.png

    Case 1 and 2 produces the same code, but yellow text is different. I think yellow text just pick some example lines out for you. Just rax is the last bool IsActive where it had been setne, rax-1 is .SpawnOrDestroyThisFrame where it was given value with mov. There is no yellow text mentioning .SpawnOrDestroyThisFrame in the first case, but the mov is definitely assigning to that variable. Therefore we conclude that in case 1, the `if` with `true` assignment inside has been ignored because this one is assign-compare. Case 2 only has assign-compare, so that produce equal code to case 1. Add 40 proofs that this faster version is still not vectorized since it moves to the next one. And it being 40 instead of 34 I think because it think 34 is a weird size so it pads the data after your 2 bools that it is a multiple of 8.. anyways you should do something to get it below or exactly 32.

    upload_2020-8-6_18-38-14.png

    The last one I think you had effectively forced it to compile the `if` case because you removed the better compare-assign version. It is longer and is likely the source of slowdown you get here. You see tons of `j` here doing crazy stuff.. there is even double nowActive clause in yellow because it seems to try to be smart by doing that early once and another in a loop that move rax+40 one by one (note that the slowdown must not be related to vectorizable or not problem but just because it has longer assembly, because it said it could not vectorize the outer loop either way)

    The first and second case also has 3 `j` which is still suspicious since you has no `if` if it decided to use the compare-assign, and therefore likely need only 1 je over there for the +40 routine so it could stop working. Why is there ja and jmp also? I found that replacing && with & in the nowActive calculation line like Joachim said will eliminate down to 1 jump as expected (!) Along with this after rearrange your struct to be leaner to get vectorized then I think it will be fast.
     
  11. Ashkan_gc

    Ashkan_gc

    Joined:
    Aug 12, 2009
    Posts:
    1,102
    @jashan I remember you were one of the first people who made an actual working game with unity which unity showcased a bit as well. It was a Qix like game IIRC :) Anyways for what it worth, We are making a mostly crafting based big world game with DOTS and we have a show system where users create firework like shows with lots and i mean lots of colorful balls. For now they don't have lights but as soon as URP adds its deferred renderer they'll have lights too.

    Atm we create like 50k entities in a second and destroy about 50k as well with the single instantiate call in command buffers in Entities.Foreach().ScheduleParallel() and our frame rate is 30 FPS on a relatively old Core i5 processor in the editor with safety checks enabled. And we did not heavily optimize anything yet. All balls have triggers and there is a job which actually processes all of the triggers and ...

    I'll write a showcase post soon and hopefully some generic tutorial sort of things on our dev blog and on gamasutra but what I wanted to say is that it is already really fast without trying too much optimizations as Joachim said.

    The beauty is that when you get used to using native containers for baking the input for a system as output of another and use linear queries (i.e. foreach) for things which are possible instead of calling componentDataFromEntity too much, it is as fast as hell!
     
    charleshendry and jashan like this.
  12. jashan

    jashan

    Joined:
    Mar 9, 2007
    Posts:
    3,306
    Ok, so then it's probably best to instantiate them all when a session begins and just hide them until needed, and then destroy them when the session is over. This is super-helpful to know. The nice thing is that in my use-case, a lot of what is going on is quite deterministic. I also have fairly specific needs when it comes to physics, so I'm also looking at Unity Physics for that.

    This is so cool. I still have some pretty crazy, ugly stuff going on with pooling in my old approach (I've had quite a few sporadic bugs that occurred due to state not being properly reset) ... now I just have to be careful to not be too cautious when rebuilding all this stuff on the new tech stack ;-)
     
  13. jashan

    jashan

    Joined:
    Mar 9, 2007
    Posts:
    3,306
    In fact, I can go much lower than that: This component is really just about figuring out if at the current time, a given Entity should be visible. So SpawnTime, DestroyTime and IsActive are actually enough. And IsActive will also go to another component (see below why I believe that's how I should do it).

    In my old approach, I had a sorted list and just checked if the next item was ready to be spawned, then incremented the index until all items that needed to be spawned were spawned, and the items were responsible for their own destruction when their time had come. Good old OOP, I will miss you ... which was elegant, especially in OO-thinking, until I realized I need to move time back and forth (looping, rewinding, jumping to a random point in time).

    What I'm working on now is a system where that kind of "random time access" works. There's a bit more complexity there because in the case of looping, I'll need to keep track of two times - but that's another story ;-)

    So, even ImpactTime is probably irrelevant for that component. Let alone the positions. And the idea behind SpawnOrDestroyThisFrame also probably is history already: The rationale behind that was to set it while iterating over the items, and then either instantiate or destroy items as needed in any given moment. After what Joachim wrote, I'll probably just need "IsActive" (and that will then also determine for which of these items I need to calculate positions, and stuff like effects).

    Oh, and also, I changed double to float. I do like double for the precision (and TimeData.ElapsedTime comes in double, and so do my audio times, which must be double for precise looping) - but for the rendering, float should be fine and simplifies a lot of things, plus is much more compact in memory.

    Hehe, I like that. That's also me when working with shaders. I stare at them, and sometimes, they give in. But more often, I do ;-)

    I might actually get away completely without any "ifs" there. In the code I posted above, it was actually "buggy thinking" but I will then need to filter for IsActive in a later step. But for that purpose, it seems that Shared Component data is what I need. I need to check IsActive for every Entity each frame, even though it will only change twice in each session (become active, then become inactive) - unless there's jumping back and forth in time.

    But then, there's a lot of stuff to be done for the (comparatively few) entities that have IsActive set to true.

    I have also read the rest of your posting and appreciate you laying it out like that a lot. I'll admit that it would take me much longer to understand everything you wrote fully - but I picked up a lot of useful information, so that's super-cool. I'm very happy! Thank you!
     
    5argon likes this.
  14. jashan

    jashan

    Joined:
    Mar 9, 2007
    Posts:
    3,306
    Haha, yeah - I remember you, too! It was originally called "JC's Unity Multiplayer TRaceON" and then just "Traces of Illumination". There were much better (and more complete) games done in Unity even at that time, though. I still did a mobile version with some 3rd party UI system, IIRC ... but never completed all twelve levels (what Valve has with 3, I have with 12, it seems ;-) ). I believe that game got stuck at Unity 2.5 ... or was it 3.5 ... probably 3.5? Then I wrote a book based on the same game (but only one level, HA!) and realized everything I did very wrong with my original approach ... that book was finished with Unity 5 (it was one of the two books on Unity in German at that time).

    Very cool!

    Yeah, not over-optimizing is definitely very sound advice! But I feel I really need to have a basic understanding of what I'm doing in the DOTS-approach. I very much loved the abstract world of OO and I guess one could say my mind pretty much works in OO ... so at the moment, getting into the data oriented mindset is a lot of effort.

    But I had quite a few epiphanies even today, so it's fun :)

    Sounds cool :)
     
    Ashkan_gc likes this.
  15. jashan

    jashan

    Joined:
    Mar 9, 2007
    Posts:
    3,306
    Hm ... maybe not? I just found https://gametorrahod.com/everything-about-isharedcomponentdata/ and believe that may have the answer ... but haven't quite digested all of it, yet ;-)
     
  16. Ashkan_gc

    Ashkan_gc

    Joined:
    Aug 12, 2009
    Posts:
    1,102
    The way it started to work for me was.
    Don't think of the problem as, what objects in the real world and with what properties and behaviors should I use to model the problem? Instead think of it as what is the data that I need (i.e. a set of entities and their instantiation and destruction times) and what do i need to do to them to have the result I want. How does it interact with other data transformations?

    Watching a few Mike Acton data oriented talks helps a lot too. Imagine he is screaming those in your face :) Then don't resist and dive in and reflect on it as you move forward.

    Regarding burst, Take a look at Intel's x86 manual or some assembly tutorials to re-learn or learn the properties of processors and how they work if needed. I did read a good portion of intel's manual really.

    The OOP VS DOD approach is kinda like pure rational phyilosophy vs experimental science and wisdom.. In philosophy you'll try to create a mental model of the world and if you are sane, you'll check to see how it works in the real world and how much the real world lends itself to it vs experimental science and wisdom which checks what happens in the world and is only concerned what is the effect of what and how to improve it.

    By looking at the real things you'll have these advantages

    1- there is no mental model to learn other than the actual thing happenning
    2- it is easier to modify since it doesn't rely on any additional models unrelated to the actual problem
    3- it has higher perf because it is aligned with the real world/hardware/running environment
    4- It is easier to maintain because of the above and when you explore something for the first time (or after a month) you need to understand how it can go wrong and how it can be changed for the better and not having to decode an additional model on top makes it less scary and more controllable

    I wrote this in case you needed more encouragement :)
     
    SenseEater, NotaNaN and jashan like this.
  17. SamOld

    SamOld

    Joined:
    Aug 17, 2018
    Posts:
    329
    If I have understood you correctly here, I think that
    ISharedComponentData
    is not the right tool. As I understand it, you are talking about storing a single bool's worth of information - an isActive flag for each component. I know of two main approaches to this.

    The first is a tag component, which is just an empty struct.
    struct IsActive : IComponentData {}
    . You add this component when you want the value to be true, and remove it when you want the value to be false. This gives you zero cost querying. Your systems can do
    Entities.WithAll<IsActive>()
    and
    Entities.WithNone<IsActive>()
    to filter. The act of adding and removing the tag does have some cost, though. These "structural changes" require a sync point.

    The other is a simple
    struct IsActiveValue : IComponentData { public bool Value; }
    . This doesn't allow efficient querying, so every system has to process every entity and check this bool. That's a lot of extra processing, but it allows you to change the value without a sync point.

    The first way is usually better because it allows systems to accurate query for only the data they need to touch, and this is a common pattern. It's the semantically beautiful approach. The second way may be better under unusual circumstances where you're changing the flag very rapidly, where the flag is almost always true so the querying efficiency doesn't filter many entities out, or where you're consuming the flag from so few systems that the structural change overhead is not justified. Personally I would always default to the first option, and consider the second option as an optimisation under special circumstances, when needed.

    We're all still working out the best practices, though.
     
    jashan likes this.
  18. exiguous

    exiguous

    Joined:
    Nov 21, 2010
    Posts:
    1,749
    Joachim Ante said in another thread that it's important to avoid branches whenever possible.
    So this bool has to be checked in a system and thus leads to slower code by default. Maybe it is better to avoid this "pattern" altogether and find a better solution for the problem?

    Thats the "problem" here. As early adopters this "burden" is placed on our backs. But it's also interesting and exciting. But requires a lot of time and effort to study.
     
    jashan likes this.
  19. SamOld

    SamOld

    Joined:
    Aug 17, 2018
    Posts:
    329
    Right, this is part of why this bool pattern is generally slower than the tag pattern. Empirically there are some situations where the bool comes out ahead, but generally it's best to use tags. If you can merge the structural change into an existing ECB sync point then there's very little cost to that. I believe that the ECS has some specific optimisation to try to make tags fast. They're the standard approach, and are used throughout the engine.

    It's worth noting that it's sometimes possible to do the bool pattern without branches. For example, an
    IncrementIfEnabled
    job body could be written as
    data.Value = select(data.Value, data.Value + 1, data.IsEnabled)
    .

    That's the burden we all volunteer for when choosing to play with preview features. I do agree that better centralised documentation would be good for everybody. A lot of what I know I only learned from stumbling across an obscure forum thread. Unity would probably get better feedback if people were more able to use these features without such a burden.
     
    jashan and exiguous like this.
  20. jashan

    jashan

    Joined:
    Mar 9, 2007
    Posts:
    3,306
    That sounds good. At the moment, at least on my PC, it looks like I might actually be best of in terms of performance, by keeping it simple and just processing everything all the time. The reason is because the number of entities seems to hardly make any difference (I can do 200, or 1000000, and the difference is between 0.03ms all the time and 0.03ms most of the time, and sometimes goes up to 0.04ms, and very very rarely to 0.05ms).

    If, on the other hand, I split the processing into two jobs (first determines which items are currently active, second does the actual processing), even with only 200 entities, I usually get 0.05ms, sometimes 0.04ms, sometimes 0.06ms. And when I ramp it up to 1000000 entities, it looks like I'm getting the exact same performance (and I'm not even filtering the items in the second step - just noping out if it's not active by if (!item.IsActive) { return; }).

    But this is on a very powerful desktop. So I'm afraid I really need to move quickly to testing this also on mobile and consoles. Also, at the moment, the processing is still quite simple, and it will eventually become a little more complex.

    So one question about the tag component: Is there an easy way to add/remove the component "naively"? In other words, can I just add the component and if it's already there, it will be fine? Or remove the component, and if it's not there, all will be fine? Or do I have to

    a) find out is it active or not
    b) find out if it currently has the component attached or not
    c) proceed accordingly

    Since c) will happen in a command buffer, I guess that already answers the question because the fewer commands that command buffer has, the better, right?

    But in that case, I wonder if adding an IsActive property back for quicker tests (at the cost of larger chunks) might be reasonable. Ok, forget it - just thinking out loud, it's actually much simpler than that:

    a) for all inactive items (WithNone): are there any that need to become active?
    b) for all active items: if no longer active, drop out and remove active in command buffer, otherwise, do processing

    So in that case, as long as the overhead of having two jobs outweighs the cost of the actual processing, I'll just run over all items, and when the actual processing becomes too costly, I fall back to the a/b approach where the second step only runs over the items that actually need processing.
     
  21. jashan

    jashan

    Joined:
    Mar 9, 2007
    Posts:
    3,306
    La la la. That's the case when I look at what's going on in the main thread, which is apparently what the Entity Debugger shows for the systems as well. I kind of expected the entity debugger to show me the actual work ;-)

    So, as a feature request, I'd love to have the Entity Debugger show; I'm very happy that we already have (via the Timeline view in the Profiler):

    a) Main Thread time (what it only shows right now)
    b) Max Thread time (i.e. the time the "slowest" thread consumes)
    c) Actual Work Time (time of all threads combined)

    All of these three values are quite important, for different reasons. I can kind of figure out b) by looking at the Timeline in the regular Profiler, and using the Timeline I do get kind of a good idea of c) as well [EDIT: Actually, the timelines shows c) already.]

    Oh well - and it even says "main (ms)" in the Entity Debugger - so forgive my ignorance.

    So, the actual work goes down to 0.00 when I only have 200 entities. At 1 million, I get about 0.6ms of max thread time with two jobs (but not optimized, so both jobs run over all entities), and about 0.5ms when using just a single job. That's on my PC. On PS4, I get 2.5ms with two jobs, 1.42ms with a single job.

    While I'll probably never have to deal with 1 million entities in my project, it's probably better to eat the overhead and process only what's necessary.
     
    Last edited: Aug 7, 2020
  22. jashan

    jashan

    Joined:
    Mar 9, 2007
    Posts:
    3,306
    So, while I'm at it ... I did 100000 entities with my "single job" approach and compared my PC and the Oculus Quest.

    One really interesting thing about this is that the Quest is "all over the place" in terms of performance. This is just a super-simple testproject, without any rendering, just the ECS-stuff. And on the main thread, I get 0.33ms, 0.03ms, 0.09ms, 1.05ms, so values in a range of factor 30 without any rhyme or reason (at least none I could parse from the profiler).

    In the Worker Threads, it's a little more consistent: The total time is usually around 1ms, there's significant variation (x2) when looking at the longest thread per job, I get: 0.43ms, 0.71ms, 0.351ms - but that variation makes a lot of sense because each frame seems to have a wildly different thread distribution (IIRC, sometimes it's just 2 threads, sometimes three).

    One thing that's unfortunate for DOTS is that the Quest only has four cores. I really wish they hadn't created such an underpowered device ... but the Burst compiler and "performance by default" certainly will still help a lot on that device.

    Anyways, on the PC, I get my usual 0.03ms on the main thread, and 0.06ms to 0.07ms on the worker thread (max time in one frame over all jobs ... but not so sure if I really always got the max). Total time there is 0.607ms, so that's interestingly not that much faster than the Quest but spread over 16 instances vs. 4 instances.
     
  23. SamOld

    SamOld

    Joined:
    Aug 17, 2018
    Posts:
    329
    This is the right approach. Query for inactive things that may become active, check if they do. Query for active things that may become inactive, check if they do. Both of these should write to the same ECB so that you only have one sync point. Try to merge this into an existing sync point if you can - things can often be delayed until the end of the frame.
     
    jashan likes this.
  24. SamOld

    SamOld

    Joined:
    Aug 17, 2018
    Posts:
    329
    This is probably premature optimisation at this point, but I should add that there are also hybrid approaches. For example, you could check once per second to find all of the entities which will expire within that second, tag those, then iterate over all of them each frame with the branch. That makes the structural changes rare but lets you still query for a very reduced subset of the data.

    You probably shouldn't consider adding this complexity unless you know that you need it. Just do the semantically sensible thing that produces clean code.
     
    jashan likes this.
  25. jashan

    jashan

    Joined:
    Mar 9, 2007
    Posts:
    3,306
    I'm not sure about having just one sync point: If my understanding is correct, this would mean that entities that just became active might only get the position updated in the next frame, which would be trouble.

    Here's my current system ... I believe I should probably change this to BeginSimulationEntityCommandBufferSystem because all of what I'm doing here basically "sets the stage" and anything else, including physics, would happen later (I might eventually have to put this into the "physics loop" but that's another story):

    Code (CSharp):
    1. public class GameplayEventsActivitySystem : SystemBase {
    2.  
    3.     private EntityCommandBufferSystem barrier
    4.         => World.GetOrCreateSystem<EndSimulationEntityCommandBufferSystem>();
    5.    
    6.     protected override void OnUpdate() {
    7.         var time = (float) Time.ElapsedTime;
    8.         var commandBuffer = barrier.CreateCommandBuffer().ToConcurrent();
    9.        
    10.         Entities
    11.             .WithNone<GameplayEventIsActive>()
    12.             .ForEach((Entity entity,
    13.                       int entityInQueryIndex,
    14.                       in GameplayEventLifeTime gelt) => {
    15.                 bool isActive = gelt.TimeSpawn < time && time < gelt.TimeDestroy;
    16.                
    17.                 if (!isActive) { return; }
    18.                
    19.                 commandBuffer.AddComponent<GameplayEventIsActive>(entityInQueryIndex, entity);
    20.             })
    21.             .WithName("GameplayEventCheckActiveJob")
    22.             .ScheduleParallel();
    23.    
    24.         barrier.AddJobHandleForProducer(Dependency);
    25.  
    26.         Entities
    27.             .WithAll<GameplayEventIsActive>()
    28.             .ForEach((Entity entity,
    29.                       int entityInQueryIndex,
    30.                       ref GameplayEventMovement gem,
    31.                       in GameplayEventLifeTime gelt) => {
    32.  
    33.                 bool isActive = gelt.TimeSpawn < time && time < gelt.TimeDestroy;
    34.  
    35.                 if (!isActive) {
    36.                     commandBuffer.RemoveComponent<GameplayEventIsActive>(entityInQueryIndex, entity);
    37.                     return;
    38.                 }
    39.                
    40.                 float timeFrac = ((time - gelt.TimeSpawn) / (gem.ImpactTime - gelt.TimeSpawn));
    41.                 gem.PositionCurrent = math.lerp(gem.PositionStart, gem.PositionImpact, timeFrac);
    42.             })
    43.             .WithName("GameplayEventMoveJob")
    44.             .ScheduleParallel();
    45.        
    46.         barrier.AddJobHandleForProducer(Dependency);
    47.     }
    48. }
    I have just done a test with 100000 entities, of which 400 are "active" at any given time, and it looks like I'm good: Even on Quest, I'm only getting 0.13ms max in the Main Thread, < 0.3ms for the CheckActiveJob, and < 0.013ms for the MoveJob.

    One thing I did think about, which is similar to what you suggested in your second posting, was to reduce the number of total entities by instantiating / destroying timeframes. But that would add a lot of complexity and potentially even cause noticeable hiccups during gameplay ... and I did my tests now with 100000 entities while I don't even think I'll often go over 10000 entities in actual gameplay.

    One thing that is kind of funny: The total time (work from all threads combined, per frame) is actually noticeably lower on the Quest. I'd wager this is due to having significantly fewer cores / threads available, so there's a lot less overhead. But still interesting.
     
  26. SamOld

    SamOld

    Joined:
    Aug 17, 2018
    Posts:
    329
    At the absolute minimum, if you add IsActive to some entities and remove it from others in the same frame, that should be a single ECB. After that, just find the latest ECB that works for your needs. What you've got there looks like it's probably okay. It's sometimes reasonable to do it at the end of the frame and introduce a single frame of delay - often it doesn't matter for gameplay if an entity despawns 1/60th of a second later than scheduled or something.
     
    jashan likes this.
  27. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,631
    I'm going to provide a bit of an opposing opinion here - lots of tag components are not as good as you think in large real-world project.

    The problem is, they increase archetype count. As archetype count increases, the cost of querying goes up noticeably and it actually becomes a pretty significant real world problem. We were noticing very bad degradation of performance over time due to this.

    I finally got around to spending some time analyzing the problem and I just spent 2 weeks doing nothing but reducing archetype count in our project (from around 20k to 7k). The results?

    Before:
    PC
    FPS at start: ~140
    FPS after 20min: ~80

    Xbox One
    FPS at start: ~38
    FPS after 20min: ~20

    After removing a bunch of archetypes
    PC
    FPS at start: ~140
    FPS after 20min: ~125

    Xbox One
    FPS at start: ~38
    FPS after 20min: ~34
     
    Last edited: Aug 7, 2020
    SenseEater, Nyanpas, Timboc and 2 others like this.
  28. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,631
    Since I love data, managed to find the profiles I took for this. Here is the comparison of the cost of systems simply querying if it should run from 1min into game, vs 20min into game before archetype optimization

    upload_2020-8-8_8-7-32.png
     
    jashan likes this.
  29. 5argon

    5argon

    Joined:
    Jun 10, 2013
    Posts:
    1,554
    I wonder if ECS at full release will guarantee complete freedom of type declaration or not, or is this inevitable? I get that nothing is unlimited but, we came so far to make performance visible while programming with DOD yet we get this new invisible regression risk.
     
    SenseEater and jashan like this.
  30. SamOld

    SamOld

    Joined:
    Aug 17, 2018
    Posts:
    329
    That's fascinating @tertle, I've never heard of this problem. To clarify, is this proportional to the total archetypes live at any one time, or to the total archetypes that have ever existed?

    If you just mean that having many tags on one entity often leads to them diverging and splitting into many archetypes, then I completely understand that. If there's some other hidden effect to do with the ECS index here then that's new to me.

    If there are any other threads on the topic of that slow down I'd be very interested to see a link. That looks like it warrants its own discussion.
     
  31. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,631
    Ever existed. The problem is archetypes are never cleaned up. Some basic garbage cleaning on archetypes from Unity would make this less of a factor (but probably not solve it).

    Here are the results of my work (note that it's only 280s, not the 20min stated above)

    ## BEFORE
    Archetypes, 4421, ElapsedTime, 10.2148427963257
    Archetypes, 7293, ElapsedTime, 20.2054958343506
    Archetypes, 9295, ElapsedTime, 30.1635055541992
    Archetypes, 11696, ElapsedTime, 60.1769142150879
    Archetypes, 14328, ElapsedTime, 120.147079467773
    Archetypes, 15204, ElapsedTime, 180.135467529297
    Archetypes, 15882, ElapsedTime, 280.127105712891

    ## AFTER
    Archetypes, 2274, ElapsedTime, 10.1827993392944
    Archetypes, 3533, ElapsedTime, 20.114933013916
    Archetypes, 3974, ElapsedTime, 30.1094818115234
    Archetypes, 4728, ElapsedTime, 60.1242599487305
    Archetypes, 5515, ElapsedTime, 120.16480255127
    Archetypes, 5804, ElapsedTime, 180.167953491211
    Archetypes, 6088, ElapsedTime, 280.145294189453

    -edit- the process

    The first thing i did was make sure entity creation set proper archetypes, and I code genned the archetypes for each of our actors as well as merge some components etc.

    #AFTER
    Archetypes, 3370, ElapsedTime, 10.3936109542847
    Archetypes, 6203, ElapsedTime, 20.1221904754639
    Archetypes, 7422, ElapsedTime, 30.114049911499
    Archetypes, 9959, ElapsedTime, 60.1656608581543
    Archetypes, 11965, ElapsedTime, 120.112945556641
    Archetypes, 12713, ElapsedTime, 180.11003112793
    Archetypes, 13362, ElapsedTime, 280.147125244141

    An improvement but still not great

    The biggest change I made was just removing specifically bad tag components and instead polling in certain cases (mostly related to pathfinding as I realized this existed in basically all states effectively multiplying archetypes.)

    #AFTER
    Archetypes, 2527, ElapsedTime, 10.1928024291992
    Archetypes, 4153, ElapsedTime, 20.1456832885742
    Archetypes, 5122, ElapsedTime, 30.1286010742188
    Archetypes, 6742, ElapsedTime, 60.1411361694336
    Archetypes, 7806, ElapsedTime, 120.105430603027
    Archetypes, 8268, ElapsedTime, 180.141937255859
    Archetypes, 8457, ElapsedTime, 280.140655517578

    Then for the rest I wrote my own ECB like structure (using my event system) that could process a combination of Add/Remove components in a single archetype change and applied this to our AI states.
     
    Last edited: Aug 8, 2020
  32. SamOld

    SamOld

    Joined:
    Aug 17, 2018
    Posts:
    329
    Wow. Definitely make a dedicated thread to discuss this @tertle! I obviously haven't put much thought into what can be done about this yet, but that type of degradation over time seems like it could be a massive problem. I would hope that that could be treated as a performance bug! It's a while since I've thought about the low level ECS implementation, but maybe some type of cleanup or compaction process can keep this under control over time.
     
    jashan likes this.
  33. Ashkan_gc

    Ashkan_gc

    Joined:
    Aug 12, 2009
    Posts:
    1,102
    Guys unity is adding enabled flag to all components which should help a lot with this archetype issue.

    @tertle do you also instantiate your entities without using prefabs and add components to them one by one? I'm asking this because you said you generated code for it.

    What I think unity can and will do is to minimize the impact of archetypes which are empty by moving them to the end of a list of archetypes or keep a more high perf index of archetypes for queries at release. They had perf issues when enablin/disabling things using tag components for the open world third person shooter demo as @Joachim_Ante said in a thread and then they added enabled to components.


    Btw thanks again @tertle for sharing the info
     
    jashan likes this.
  34. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,631
    This project is quite old, pre-dates entities but was partially converted like 1.5-2 years ago when the scope expanded (note though I was only brought on 8 months ago so was not involved in this process/decision or any of the early development of the project.)

    Because of the age, scope and early adoption, it has a lot of quirks. Uses some hybrid solution on client with a lot of legacy code (which I mostly avoid and let others deal with ;)), but the server is pure entities.

    However even these pure entities are generated from setting files not prefabs due to the tooling that existed from before. From what I was told, the team at the time thought that scene conversion workflow was only a stop gap or something

    Anyway these entities weren't generated completely just by adding components. Instead actors shared a base archetype (health, networking components etc) then any component specific to the actor was added depending on settings. Just to remove pointless archetypes, I recently changed it to 1 click code-gen archetypes based off currently setting files in the project so no components needed to be added.

    (This is definitely not how I recommend people use Entities in their own projects and, all my personal projects use subscenes/conversion/etc, it's just the quirks of dealing with old projects.)

    Yeah we were kind of hoping for this but unfortunately due to releasing sooner than later and wanting to stick to 19.4, this won't help with our existing project till at least a post release update.
     
    Last edited: Aug 8, 2020
    florianhanke and jashan like this.
  35. exiguous

    exiguous

    Joined:
    Nov 21, 2010
    Posts:
    1,749
    This is not a bad thing at all. We are all trying to learn and experiment and adapt to a new technology. So new insights are always valueable!

    Do you mean over runtime of the game or development time?
    What I'm wondering is, if the queries can't be cached somehow internally by Unity? If a certain archetype did not match a certain (cached) query in the last frame how can it match it now? At least queries which look for certain components could somehow reduce the amount archetypes they have to look into. Or do I understand the issue completely wrong?
    But as you say you do alot adding and removing of (tag) components. I thought more of a scenario where archetypes are pretty predefined/static and not EVERY combination of components is valid. And that unused archetypes are not cleaned up is concerning. I wonder if that is for a certain reason by design or if it is just an oversight which is going to be fixed some day.

    Did you just remove some tags or did you replace them with another pattern/workaround to maintain their functionality?

    Do you know why the FPS is degrading? Do you believe it has solely to do with your archetype count by tags or with other reasons?

    When you say they should be cleaned up you mean some archetypes are unused by now? But if their chunks are empty it should not affect performance for a search too much since Unity probably checks if there are entities at all available. So I wonder if this is really the reason for your performance drops. At least with that significance.

    AFAIK archetypes are defined per world. Would it help do to such stuff (pathfinding) in another world and destroy that from time to time? Should this reset the archetype count again?

    Anyway. This is really a great discussion. Thanks to all contributing to it.
     
  36. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,631
    runtime

    The cost is just querying if there are any entities in the archetype.

    https://forum.unity.com/threads/bur...-behaviour-documentation.946743/#post-6181533

    It was nearly purely the archetype count. Changing no logic except reducing archetypes fixed the issue.

    That's the thing though. If you look at this picture, the cost of simply checking if any entity exists is what goes up.
    https://forum.unity.com/threads/bur...-behaviour-documentation.946743/#post-6181437

    The act of pathfinding doesn't get slower at all. It's the overall time to execute systems. It's just that the pathfinding components were creating a huge amount more archetypes.
    Side note, I've actually scheduled pathfinding and avoidance in a way that it's pretty much free in our project, it completes entirely while waiting on the GPU to render the frame. I can schedule more than 10x as many entities to pathfinding without changing FPS at all.

    -edit-

    actually you've reminded me
    there was also a bug in addressables I found while profiling this that hurt performance on consoles a little over time (barely affected PC) so you could probably add 1-2 FPS to the degraded result to be fairer. I don't recall if I ever reported this to Unity so I should probably go make a bug report for that.
     
    Last edited: Aug 8, 2020
    NotaNaN, exiguous and jashan like this.
  37. Ashkan_gc

    Ashkan_gc

    Joined:
    Aug 12, 2009
    Posts:
    1,102
    @tertle How much does specific idiosyncrasies of your system adds to the prototype count? I get that if you have tags for every possible state and then with 5 tags theoratically you have 120 archetypes but want to know that how much of it was specifically do to your system and how much it can really happen to others. Again I'm not saying the problem doesn't exist but your case seems a bit too extreme. I want to find out how much should i be afraid of this in our own game. So far we are not close to that at all and actually even with say 10 tag components for the player, the valid combinations are less than 20 so it doesn't grow that fast.
     
    SamOld likes this.
  38. jashan

    jashan

    Joined:
    Mar 9, 2007
    Posts:
    3,306
    Turns out this is not feasible in Unity 2019.4, when using the legacy render pipeline, because Hybrid Renderer V1 has no way of switching off renderers, and V2 does not work with legacy render pipeline. I read somewhere that eventually, there will be an "enabled" state for entities, so that would be nice. But for now, I have to instantiate / destroy those entities.

    This is probably the better way, anyways, because there's a lot of stuff going on otherwise, also with all the transform operations (each "renderer" is a hierarchy of objects, there's physics ... so only having those in existence that are currently needed is the reasonable thing to do anyways).

    There's no noticeable overhead instantiating / destroying those. I'm currently testing with 100000 total entities, spawning 400 per second, and each has a lifetime of a second, so there's 400 visible at any given time.

    On the Quest, I still get 72 FPS, the CPU is at 20%, the check active job (that iterates over all 100000 entities) takes 0.43ms per thread, 1.1ms total time, the move job (400 entities) is kind of hard to find in the timeline, takes 0.011ms per thread, 0.026ms total.

    The entity command buffer that handles my instantiation / destruction does take 0.25-0.3ms per frame, so I guess that's my overhead ... but creating/destroying 400 entities per second is much much more than I'll need. And that's on quest ... lowest end hardware. Also, the GPU is at 90% in that test-scenario (that's two meshes per entity, 400 entities), so there's my real limit.

    When I only do 50 per second (which is still a lot more than I'll usually need), GPU goes to 40%, so that's more reasonable, and in that case, the command buffer goes down to 0.02ms. Also, the move job goes down to 0.001ms, and max total time is around 0.01ms.

    Funny thing I ran into: Unity 2019.4 (with the old VR approach) can't do Single Pass Instanced on the Quest. The new VR approach can only do Single Pass Instanced (and not Single Pass). In this simple test-project, I can do both on PC but cannot measure a difference when it comes to performance (I looked both at GPU and Camera.Render on the Main Thread and Render Thread ... could be about 0.05ms difference on the Main Thread but hard to say ... it's around 0.45ms usually there, and might kind of go up to 0.5ms with Single Pass compared to Single Pass Instanced).
     
  39. Nyanpas

    Nyanpas

    Joined:
    Dec 29, 2016
    Posts:
    406
    Whoa, this was unexpected. Nice to know this, as I have started the transition to DOTS for my game. I assumed that what Joachim Ante has stated earlier about GC and pooled memory would hold true for this as well. (I tried to find the post but can't seem to find it...)
     
  40. exiguous

    exiguous

    Joined:
    Nov 21, 2010
    Posts:
    1,749
    this one?
     
    Nyanpas likes this.
  41. Nyanpas

    Nyanpas

    Joined:
    Dec 29, 2016
    Posts:
    406
    exiguous likes this.
  42. exiguous

    exiguous

    Joined:
    Nov 21, 2010
    Posts:
    1,749
    I did not "search" for it. I just have a textfile where I store links and quotes from posts which I find are important for me learning ECS. And its just frustrating when I remember that I had read something somewhere somewhen about a topic but can't remember specifically. So this textfile serves as a "extension" of my brain. ECS is just too much information to keep in mind for now. Especially the new syntax is difficult for me so it's better to look it up.
    Anyway. I'm glad it helped you.
     
  43. Nyanpas

    Nyanpas

    Joined:
    Dec 29, 2016
    Posts:
    406
    I have a textfile too, at 57 pages now, but somehow this was not in it... It's probably because I also have several notes in it stating that DOTS/ECS is still experimental and that a lot of the links in there already have deprecated ways of doing things, and I would need to clean it up.