Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

Comparing different approaches for Events in DOTS

Discussion in 'Entity Component System' started by PhilSA, Apr 14, 2022.

  1. Krajca

    Krajca

    Joined:
    May 6, 2014
    Posts:
    347
    I think even the Nordeus demo has at most a couple of thousands of events per frame. Which isn't that much. But if you have 1-2ms more in the frame that means you can cram more features on top of that what you already have. We should also consider multiplayer servers. Each player in different parts of the map multiplay amount of work needed to be done and computation on the server-side isn't cheap.
    So yeah maybe we overthinking the event system here, but really it depends on our use cases.

    @PhilSA Thank you for your work. It's very inspiring. It provides much-needed samples of how to do something in DOTS that isn't "create an entity for it".
     
    JesOb and lclemens like this.
  2. lclemens

    lclemens

    Joined:
    Feb 15, 2020
    Posts:
    714
    My game isn't open-source, but I'd be more than happy to send you some of the relevant classes and pass on some of the things that worked well and didn't work (I tinkered with a couple of architectures and that failed miserably). When you're ready, shoot me a PM.
     
    Opeth001 likes this.
  3. toomasio

    toomasio

    Joined:
    Nov 19, 2013
    Posts:
    195
    Just popping in to say this thread was very helpful for me. I have a really hard time wrapping my head around creating an efficient "Reactive Event Loop" where the entire event chain can be processed in one frame, multithreaded and bursted in an DOD environment....without passing through multiple "EventSystems" doing the exact same thing. Similar to a modular "Visual Scripting" environment, where the modular pieces won't always be in the same order. The only thing I have seen so far that has hacked around this is @PhilSA 's poly components. Only other solution I can think of are massive struct components and possibly referencing blobs via IDs. If anyone here that is hardcore DOD can come up with a solution for this, that doesn't involve a next frame Queue, please let me know! :)
     
  4. lclemens

    lclemens

    Joined:
    Feb 15, 2020
    Posts:
    714
    I'm far from an expert so I'm just taking a stab in the dark here, but if you absolutely need the reactions to events in the exact same frame that generated them, perhaps you could do something like what I did with my splash damage system?

    upload_2022-4-22_17-37-7.png

    I made a new system group that happens AFTER the end-simulation event queue plays back... then my ImpactSplashSystem will see those events and do whatever it needs to do. In the case of "damage events", if you put the system that handles damage events (or other systems that need to react to an event) into that After-Sim group, the events would be queued and processed all in the same frame. I think it would work with either ECBs or the NativeStreams. One downside is that if your systems require special ordering it could get messy and you might end up in a catch-22 where a system needs to run both before and after the queue playback.

    Another idea... tertle has a version 2 of his event systems that can apply events in the same frame, but I think it requires the sender to be in the same system group as the receiver.

    I suspect that no matter what approach is taken, trying to generate and process events in the same frame will come with some restrictions on system ordering. But that's just a theory... I don't have proof and I could easily be mistaken.
     
    toomasio likes this.
  5. toomasio

    toomasio

    Joined:
    Nov 19, 2013
    Posts:
    195
    Yeah this is essentially the furthest I have gotten to solve this issue...besides using poly components. Having a processor group, reaction group running the systems needed. Unfortunately this does not reliably work if you are trying to do a designer-first approach...where your team might need to move modular puzzle pieces around and the event reactions might loop all the way back to the initial event type again. For example, a damager causing damage...then the reaction to taking damage is to reflect damage back to the sender. You'd have to make multiple of the same systems if there was ever a looping situation. Maybe my brain is thinking too OOP to figure this out.
     
  6. PhilSA

    PhilSA

    Joined:
    Jul 11, 2013
    Posts:
    1,926
    EDIT: as I kept expanding on my reply, it became less and less a direct reply to the point you made, and more and more a bunch of random thoughts on attribute systems. Sorry about that
    ___

    I think the difficulty is when the processing of one event can lead to the creation of a new event. And that chain reaction can go on up to X amount of times, but it must all be done in the same frame. Depending on the amount of different jobs responsible for event processing, the # of jobs you need to schedule for a worst-case scenario could rapidly go out of control.
    • Ex: you have 20 event-processing jobs, and some events in there can create other events. You need to put your 20 jobs in a EventUpdateGroup, and run that group X amount of times, where X is the max possible amount of times an event can create another event in a frame. You end up having to schedule X*20 jobs "just in case"
    It's the kind of problem you'd face when making a "RPG Attributes with Buffs system", where the value of an attribute can depend on the final buffed value of another attribute (MagicResist is calculated based on the buffed Intelligence value). I often bring up this RPG Attributes use case in these discussions, because I really think it's a good example of a problem that is very hard to solve with a typical ECS mindset. This problem involves a mix of ordering, reactive behaviour & polymorphism. Each one of these things can be tricky to deal with in ECS when we have an OOP background, but when we have all 3 at the same time, it's a whole new level of difficulty

    I can think of two solutions in this situation. For each approach, assume each attribute has its own entity (because it must hold a buffer of buffs affecting it, and it must be referenceable) :

    Have a pre-determined order of attribute types for recalculation (Constitution, then Defense, then DamageReduction, etc....). Then schedule one attribute recalculation job for each attribute type, in that order. Each job uses change filtering on a "AttributeMustRecalculate" component to determine if we recalculate or not. So if you have an ActionRPG-style game, you'll likely be scheduling 40-60 change-filtering jobs per frame; one for each attribute type in the game. And if attribute recalculation is part of network prediction (which it will be in most games that use prediction), you'll have to multiply this by 2-10 times depending on network conditions. So you'll be scheduling up to (10 * 60) = 600 change-filtering jobs per frame. That means probably ~0.5-1ms of your frame budget will be gone just because of change polling for attributes, even when no attributes need recalculation.

    Overall, this solution would make me uncomfortable. Especially since its cost will scale with attribute types count, and attribute recalculation will not be that frequent. So it's an enormous constant price to pay for such a rare occurrence.

    To make matters even worse, I've encountered situations where I needed to trigger an attributes recalculation multiple times per frame:
    1. before weapon updates, in order to have up-to-date weapon firing rate attributes
    2. after projectile prefab spawning, in order to make sure newly-spawned projectile attributes are calculated before they detect hits (potentially in the same frame they were spawned)
    3. sometimes even a third time when the projectile spawns something else on hit, that also has attributes that must be recalculated before it runs some logic
    That means we multiply our 600 jobs by 2-3, for a ridiculous grand total of 1200-1800 change-filtering jobs per frame. At this point, your target framerate better be 30fps or else you'll be in trouble :D. However I won't use this as a main argument against this approach, because many would just choose to have weapon shooting be 1 frame late in this case. It's still a sacrifice to make, though

    Keep a single global NativeQueue of attribute entities that must be updated (in other words; a "RecalculateAttribute" events queue). Then have a single job that iterates on them and recalculates them in order. If a recalculated attribute knows it has other attributes that depend on it, it simply adds those attribute entities to the end of the queue. So no matter how many attribute-recalculation chain reactions we have, it can all be done in one job, in one frame. The cost of recalculating attributes will be slightly higher than approach 1 because of having to use CDFE, but we eliminate the massive constant cost of scheduling & change-polling for 600 jobs. In this case, I think it's a great tradeoff.

    This solution is pretty much an example of "events creating other events upon being processed", but all done in one job.

    The cost is directly proportional to the # of attributes to recalculate. If no attributes need to be recalculated, this approach has a near-zero cost. You could call for attributes recalculation 50 times per frame if you wanted to, and the impact would barely be visible if no attributes need a recalculation

    No need for manual definition of attribute update orders with this. You can have a spell that makes your Strength scale with the Agility of a target, or your Agility scale with the Strength of a target. There is a way to prevent infinite loops, by keeping track of how many "passes" we've gone through in the queue and imposing a maximum. In the case of an infinite loop, the attribute value will simply keep growing every frame without blocking the program. I think it's an outcome that theoretically makes sense, but whether that's a good thing or a bad thing is up to you to decide

    In both cases however, you'll face the challenge of how to efficiently apply all the different types of buffs on a specific attribute entity when the time comes to recalculate it. Since the values of a buff can depend on the final buffed value of another attribute (like a buff whose multiplier value scales with Intelligence of caster, even after being applied), buff effects must be calculated on a per-individual-attribute basis; not globally per buff type. This is the main reason why I've made the PolymorphicStructs tool. Trying to avoid that strategy will likely result in having to do tons of sync points with structural changes (or tons of change polling), and having to schedule tons of additional buff-related jobs per frame. But this post is already too long so I'll spare the details
     
    Last edited: Apr 24, 2022
    Krajca, lclemens and toomasio like this.
  7. toomasio

    toomasio

    Joined:
    Nov 19, 2013
    Posts:
    195
    Thanks @PhilSA for the very detailed answer. Glad to know there are others who already tested some solutions to this problem. I think I am going to use your poly structs pattern...although it does feel like "cheating" for some reason. Maybe Unity will come up with some magic "generic" container struct (if even possible) that will solve this issue in the future...because I feel like this is a pretty large problem when it comes to the ECS pattern.
     
  8. CodeSmile

    CodeSmile

    Joined:
    Apr 10, 2014
    Posts:
    4,011
    FYI I ported the project to make use of Unity TestRunner with the Unity.PerformanceTest package as an exercise to using both with ECS code.

    Here's the first results, I haven't verified that it actually does the same thing as playing the scene but numbers seem reasonable on first look:

    upload_2022-4-26_20-9-6.png

    I'll try to manage sending a pull request in a couple days.
     
    Last edited: Apr 27, 2022
  9. CodeSmile

    CodeSmile

    Joined:
    Apr 10, 2014
    Posts:
    4,011
    @PhilSA Pull request is sent.

    I made a couple observations while running the tests and playing around with DOTS/Jobs/Burst options (Ryzen 9-3900 12 cores):
    • Jobs => Burst => Safety Checks => Off
      • Has noticable effect on some tests, even changing order of results (when sorted by median). This may need to be considered in summary. If I remember correctly, in release builds those safety checks are off, right?
    • added [BurstCompile(OptimizeFor = OptimizeFor.Performance)] to all Jobs
      • Some tests benefit from this, others don't. Those that do are between 0.5-1.0 ms faster (median of: A, D, G, H, I)
    With above optimizations, here are my results for Edit mode:
    upload_2022-4-27_13-15-33.png

    ... and for Play mode (median more than 0.3 ms faster: C, F, H, J):
    upload_2022-4-27_13-19-38.png
     
    Last edited: Apr 27, 2022
    JesOb and PhilSA like this.
  10. PhilSA

    PhilSA

    Joined:
    Jul 11, 2013
    Posts:
    1,926
    merged it! Thanks for this contribution, it's gonna make testing a whole lot easier

     
    Last edited: Apr 27, 2022
  11. rivFox

    rivFox

    Joined:
    Nov 16, 2016
    Posts:
    6
    @PhilSA , can you edit your first post to fix `System Code` link? Now all files have a prefix. Thank you ;)
     
    PhilSA likes this.
  12. Enzi

    Enzi

    Joined:
    Jan 28, 2013
    Posts:
    908
    I've now a pretty good solution. @DreamingImLatios would be proud as it follows his suggestion and way of thinking about the problem.
    I had already posted the naive version which run pretty slow. What it did was writing "events" to a NativeMultiHashMap then iterating on the chunks of entities with Health, looking up if there's a key/event and applying all the events. In my project I'm using 2 NMHM. One that has the source as key and one where target is the key.

    The problem with the solution was the slowness of writing to the NMHM. So with the help of @tertle I managed to develop a pretty good solution.
    3 extensions methods are needed:
    - the first extension converts the key and value pointers from a NMHM to a NativeList
    - the second extension reserves a whole block when Adding to a NativeList so the Interlocked on the length isn't stalling worker threads.
    - the third extension calculates the buckets when writing to the key/value array is done (this is FAST, like 0.8ms for 500k keys - my machine is factor 2 slower than PhilSA, all his tests are 2 times faster, hehe :) )

    The downside with this solution is:
    - Enough memory has to be allocated for the NMHM as no further memory can be allocated in parallel. This would lead to events that are lost (pretty terrible) but when mindful about how many events an entity can fire it's okay. And this is only really a problem when ALL entities are firing events in a single frame. In a real game environment, running into this problem is pretty much 0.
    - The NativeList extension allocates is setup on every entry of a chunk Execute. On the first add a block with a certain amount is reserved. The less entities fit in a chunk, the more blocks will be reserverd and some elements in this block can be empty. There's another method that fills these empty elements but it certainly would be nice if this wouldn't happen. I have no solution for it though, as the amount of fired events is unknown.
    - There's an overhead when no events are happening because the entity still has to be polled. This could be improved with a ChunkComponentData although it's hard to say if it's worth it.

    This method scales really well and has the benefit of being extendable and flexible enough for managing such events.
    For example, I need to manage TriggerMaps, IsCombatFlag, casting pushback when damaged and several effects like absorb. Events on targets have different code than on sources. All this can be split up now very conveniently and one time events that don't need to be applied every time (like IsCombatFlag) can be optimized. Furthermore, much data can be calculated on stack and not in memory and written back at the end instead of using the IComp data and having to write back all the time.

    Overall, this is faster than any of the tests written here, except for Interlocked, which is still king in this particular case.
    Although I still want to get cpu times closer to Interlocked, we have to be mindful that Interlocked doesn't scale for every piece of code that well, some code can't even be translated that well or not at all.

    For very specific use cases/problems, that are fixed, known and operate on basic data types I'd still use Interlocked without any question. Really, the amount of work needed against Interlocked is NOT worth it.
    In the more general case of ours, where damage can do a number of things, Interlocked can get in the way when complexity rises and having a general race condition problem is never good.
     
    Last edited: Apr 27, 2022
    toomasio and PhilSA like this.
  13. PhilSA

    PhilSA

    Joined:
    Jul 11, 2013
    Posts:
    1,926
    If I understand correctly, this approach would mostly be for when events are happening on most frames for most entities, correct? I suppose there's no way to do any sort of "change filtering" here in order to not have to check each entity every frame.

    EDIT: Or maybe our event writers could also set a byte to 1 on a "IsDirty" component on the affected entity, and do the change filtering on that
     
    Last edited: Apr 27, 2022
  14. Enzi

    Enzi

    Joined:
    Jan 28, 2013
    Posts:
    908
    Correct, there's certainly an overhead when no events are happening. I'll add this to the downsides.
    I've to add that it's possible to iterate over the NMHM. This would make the case of, no events happening faster but as it would involve random access to the entities, under stress it's a lot slower.
    Also something to consider, when no keys are in the NMHM the lookup is ridiculous fast and the iteration over the chunks with just reads also. In my project running with 250k casters, no events the "polling" job runs at 0.26ms.
     
    Last edited: Apr 27, 2022
    toomasio likes this.
  15. CodeSmile

    CodeSmile

    Joined:
    Apr 10, 2014
    Posts:
    4,011
    @PhilSA What are your system specs, CPU/Memory in particular?
    Tests E and J - which are already the fastest ones - are notably faster in your measurements while F (the slowest) is even a bit slower.

    My specs:
    AMD Ryzen 9 3900X 12-Core 3.8 GHz
    32 GB (2x16) DDR4-3600 G.Skill CL16
     
  16. PhilSA

    PhilSA

    Joined:
    Jul 11, 2013
    Posts:
    1,926
    • Intel Core i9-12900k (16 cores, 24 threads, 3.2Ghz)
    • 32GB ram
    • RTX 3080 (not that it matters here)
     
  17. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    It sounds interesting. I'm not sure how it makes writing to NMHM faster from your description.

    Anyways, during my break away from this conversation, it became much more apparent what you guys want. And while I disagree with it, I can at least provide a solution that I think you guys may like and would be interesting to add to the performance comparison (assuming you understand it). Unfortunately, I don't have a bread-and-butter example of this, since every time I employ it I take advantage of other assumptions of the problem which results in different tweaks. This is effectively my workaround for some of the issues I have with NMHM. The pattern goes like this:
    1. Write key-value pairs to blocklists (NativeStream or similar) in parallel.
    2. Copy the blocklist to a NativeList.
    3. Sort the NativeList.
    4. Allocate a second NativeList and store start offset and count pairs for matching keys
    5. Iterate by key in parallel.
    The main benefits of this technique:
    • No capacity constraints
    • With the right algorithm, the sorting step makes the behavior deterministic, something you don't get with NMHM.
    • Steps 2, 3, and 4 do not depend on ECS and can run during a sync point.
    Sorting might sound really slow, but it isn't if you pick a good algorithm. And it is striking two birds with one stone in that it solves both the grouping problem and determinism problem.

    My first example is in a binding system, where I need to bind skinned meshes to their skeletons. The skinned meshes have a reference to the skeleton, but the skeletons need to know about the bindings and do some setup work. Here's the behemoth system that does this work:
    https://github.com/Dreaming381/Kine.../Systems/SkeletonMeshBindingReactiveSystem.cs

    The key variables are at lines 80-84. The jobs scheduled at lines 88-118 are step 1. Steps 2-4 happen at 174 and step 5 happens at 206. 247 defines the binding operation. Here you can see I am using IComparable as for this implementation I am using Unity's sorting algorithm. I sort primarily by the target entity, which is the main key, but for ties, I then sort by the operation type, since that's free batching of operation in the job that consumes all the bound meshes. And because Unity's sort is unstable (which breaks determinism in the case of ties), I use the source mesh as a tiebreaker since that is guaranteed to be unique. Steps 2-4 run during a sync point. And you can see the implementation of that job at 478. The step 5 job definition is at 765.

    Also in that system is a similar pattern recording new, changed, and removed mesh blobs. For this one, steps 2-5 all happen in a single IJob which also maintains allocations in a GPU buffer and reference counts. Both sorts run while SystemState components are added to new entities on the main thread, and they both finish before the main thread does, essentially making them "free".

    Another use case is how I do instantiations, which I wrote a full article on here: https://github.com/Dreaming381/Lati...tion Adventures/Part 4 - Command Buffers 1.md

    There, I need to batch prefab entities together, but each prefab entity reference also has a block of component values that need to be assigned to the resulting instantiated entity. So what I do is I split the key (prefab and sortKey) and value (components to assign) into two separate blocklists. Then I sort the keys as well as pointers to the values so that I don't waste memory bandwidth moving values around and that keeps the sort fast. Because keys aren't guaranteed to be unique (in fact they often aren't), I need a stable sorting algorithm for determinism, so I use a radix sort. I actually end up sorting the values a second time to group the pointers by chunk so that I can chunk-iterate the newly instantiated entities.

    And finally there's this example, where I need to scatter chunks into lists based on a chunk component mask:
    https://github.com/Dreaming381/Kine...xtremeHierarchy/ExtremeLocalToParentSystem.cs

    Lines 71-87 are the scheduling, and lines 204-262 are the jobs. For this, determinism wasn't important, so there is no sort step. Instead, I end up having an array of blocklists and an array of NativeLists for the scattering, because I have a limited number of mask bits and need to schedule an independent job for each.
     
  18. lclemens

    lclemens

    Joined:
    Feb 15, 2020
    Posts:
    714
    Hey everyone, sorry to bug you all, but now that 1.0 is out, I'd like to get your opinions on how it affects your approaches to "events".

    One quick thought - From the benchmarks we can see that most of the approaches @philsa-unity profiled have a minor cost of "polling" for changes even when no events are being sent. While small, this might add up if there are say 20 systems all listening for an event and there are lots of entities. An enableable component will turn the polling from a poll-per-entity to a poll-per-chunk.

    Another idea I had is that global events are possible now by putting a collection in a singleton. I saw that tertle is using that now.

    What other thoughts do you have on the topic? Has anything in 1.0 caused you to switch things up?
     
  19. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,626
    Since you mentioned it, my little implementation for this is here:
    https://gitlab.com/tertle/com.bovinelabs.core/-/tree/0.15.5/BovineLabs.Core/SingletonCollection

    It's basically a many to one collection singleton. Each system creates their own native container and stores it on the singleton, and 1 system does work from this.

    Basically think how EntityCommandBufferSystem works except instead of only support EntityCommandBuffer it supports NativeArray, NativeList, NativeQueue, NativeHashMap, NativeParallelHashMap, NativeParallelMultiHashMap and NativeEventStream (my old event system container.)

    Also the reader system can work on it in a job by using a double rewind allocator (unlike ecbs which does work on main thread.)

    Again though, I still really only use this for debugging tools. My current implementations are on my DrawSystem and AI debugging.
     
    Last edited: Jul 6, 2023
    lclemens and philsa-unity like this.
  20. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    Enableables allow for a few new tricks, but they are particular to very specific situations (mostly strict 1-1 relationships). Far from any silver bullet, and don't really help the generalized desires this thread was seeking to fulfill.

    Singletons don't add anything extra other than slight performance improvements when the "event containers" need to cross system boundaries, and probably a lot less code to accomplish the same tasks.

    But if you can't tell, I haven't really changed my opinion at all on this. If you really care about the performance, you have to really define your problem and the things you can assume about it, and then pick a data flow that minimizes the amount of memory work required. If you don't care about performance quite to that extent, just pick the simplest single-threaded technique you can think of and move on.
     
    lclemens and philsa-unity like this.
  21. lclemens

    lclemens

    Joined:
    Feb 15, 2020
    Posts:
    714
    I was thinking that writers could append to the buffer/list/stream and then enable a component to indicate that the buffer size is > 0. Then the readers would check for the existence of the enableable component and only run when it exists (thus polling per chunk instead of per entity).
     
    Last edited: Jul 7, 2023
  22. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    Not thread-safe.
     
  23. lclemens

    lclemens

    Joined:
    Feb 15, 2020
    Posts:
    714
    I think it would work with Approach B (a single writer thread). It would also work in an event system where an ECB is used to append events.
     
  24. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    I suppose those are both true. I didn't really think of that because almost always there are better options available if performance matters that much.
     
    lclemens likes this.