Search Unity

Official Professional Service DOTS-Porting Postmortem

Discussion in 'Entity Component System' started by victorluishongchau, Feb 25, 2021.

  1. victorluishongchau

    victorluishongchau

    Unity Technologies

    Joined:
    Feb 7, 2017
    Posts:
    13
    Professional Service DOTS-Porting Postmortem

    At Unity Professional Services, we help our users become more successful with their projects made with Unity. Often we help them achieve their performance goals through consulting; often we get embedded in our users’ team and write some subsystems for their project; and sometimes we even write the majority of a project’s code. We will share some interesting technical details of one such project, where we provided a proof of concept porting a vertical slice of a Unity project to the Data-Oriented Tech Stack (DOTS).

    Unity has a very easy-to-learn object-oriented tech stack where Transforms and MonoBehaviours create a framework for building games, RigidBodies and Colliders are used for physics simulation, and MeshRenderers and SkinnedMeshRenderers are used for rendering. These features allow users to build and prototype projects quickly and helps developers focus on creating their vision without having to dive deep into the code.

    However, we knew we could deliver even more power into user's hands by utilizing the full power of the hardware. DOTS is a set of technologies that aims to help developers achieve “performance by default” by embedding data-oriented principles into our new game engine framework. The Unity Professional Services team explored this commitment to more performant code by refactoring an existing Unity project to use DOTS in the hope that we would see significant performance gains.

    In this post, we’ll share some technical lessons we learnt as we developed our DOTS project. In particular:
    • We first discuss our skinning system as an introduction to writing performant scripts in DOTS.
    • Then, we dive into our implementation of a character modification system to explore resource management in DOTS.
    • Using the same example, we discuss the value of leveraging lower-level DOTS technologies, such as Native Collections, Burst and Jobs, alongside the core Entity Component System (ECS) framework.
    • Finally, we explore various approaches to efficiently perform structural changes with ECS through some benchmarking tests we wrote for our character modification system.
    Note that DOTS is still in active development and that best practices might change as it evolves. For reference, our project was developed on Unity 2020.1 with the following ECS package versions:
    • "com.unity.entities": "0.11.1-preview.4"
    • "com.unity.rendering.hybrid": "0.5.2-preview.4"
    • "com.unity.physics": "0.4.1-preview"
    Before reading further, it may also be helpful to first get familiar with ECS concepts, some example DOTS code, and the DOTS best practices guide (written by a colleague in our team!).
     
    Last edited: Mar 23, 2021
  2. victorluishongchau

    victorluishongchau

    Unity Technologies

    Joined:
    Feb 7, 2017
    Posts:
    13
  3. victorluishongchau

    victorluishongchau

    Unity Technologies

    Joined:
    Feb 7, 2017
    Posts:
    13
    Skinning

    When our project began, the built-in DOTS skinning system was in the DOTS Animation package, which was still in the very early stages of development. Since our project’s simulation is completely driven by physics - even the characters are all active ragdolls - and we needed to minimize technical risk from dependencies, we decided to avoid dependency on DOTS Animation and write our own simple DOTS skinning system.


    The system is fast because it is very simple:
    • In the vertex shader, the skinning matrices are fetched from a Compute Buffer. Multiplication of vertices by skinning matrices and pose blending is performed on the GPU.
    • The Compute Buffer is uploaded every frame from the CPU to the GPU. If the buffer size was large, this might have been a performance concern, because CPU-to-GPU data transfer is relatively slow and the GPU may stall and wait for the data to be uploaded. However, in our case this is okay since the buffer size is small.
    • For each bone, its skinning matrix is calculated from a number of parameters - including the bone’s LocalToWorld transform - and written to the Compute Buffer’s CPU-side memory. This is implemented with ECS Systems.
     
    Last edited: Mar 17, 2021
  4. victorluishongchau

    victorluishongchau

    Unity Technologies

    Joined:
    Feb 7, 2017
    Posts:
    13
    Updating skinning matrices

    When constructing the skinned mesh in DOTS, the most intuitive thing to do is to let the skinned mesh Entity keep a DynamicBuffer of bone Entities that animates the skinned mesh.


    If we structured our Components in this way, we’d have to update the skinning matrix roughly as below:
    Code (CSharp):
    1.  
    2. struct SkinningMatrixStorage : IComponentData
    3. {
    4.     public int offset;
    5. }
    6.  
    7. struct BoneElement : IBufferElementData
    8. {
    9.     public Entity Value;
    10. }
    11.  
    12. public class BoneUpdateSystem : SystemBase
    13. {
    14.     void StoreSkinningMatrix(int offset, in float4x4 value){}
    15.  
    16.     float4x4 CalculateSkinningMatrix(in float4x4 boneLocalToWorld,…){}
    17.  
    18.     protected override void OnUpdate()
    19.     {
    20.         var getLocalToWorld = GetComponentDataFromEntity<LocalToWorld>();
    21.  
    22.         Entities.ForEach((in Entity entity,
    23.          in SkinningMatrixStorage skinningMatrixStorage,
    24.          in DynamicBuffer<BoneElement> bones) =>
    25.         {
    26.             for (int i = 0; i < bones.Length; ++i)
    27.             {
    28.                 StoreSkinningMatrix(skinningMatrixStorage.offset,
    29.                   CalculateSkinningMatrix(getLocalToWorld[bones[i].Value].Value,…));
    30.             }
    31.         }).ScheduleParallel();
    32.     }
    33. }
    34.  
    The problem with this approach is that we are accessing the bone’s LocalToWorld data via ComponentDataFromEntity. This means that we have no guarantee that adjacent calls to read the bone’s LocalToWorld are reading data close to each other in physical RAM. Random access like this increases the CPU cache miss rate and has a significant negative impact on performance. (*)

    (*) It is true that a 4x4 LocalToWorld matrix is 64 bytes, which is the size of a full cache line on most CPUs these days. So having data locality is not really as important here on current-gen CPUs. However, linear access would still be better than random access, because it makes it more likely for data prefetch optimization from compiler or hardware to operate successfully. Moreover, ComponentDataFromEntity is in fact 2 indirections - 1 looking up pointer to data from Entity Index, 1 looking up the actual data. Not having to call ComponentDataFromEntity for a wide variety of Entity Indices to look up memory pointers is also a significant saving. We omit these details for a simple introduction to DOTS scripting - let’s pretend we have a CPU with a very large cache line here.

    Since the most common operation in skinning is iterating over many bone Entities and reading their LocalToWorld component, we want to prioritize making this iteration fast. To achieve this, instead of a skinned mesh Entity keeping a DynamicBuffer of bone entities, let’s make each bone Entity keep a DynamicBuffer of those skinned mesh Entities that the bone Entity is animating.


    Then, we can iterate through each bone, calculate its skinning matrix, and store the result for the skinned mesh Entity this bone is animating - as below:
    Code (CSharp):
    1.  
    2. struct SkinningMatrixStorage : IComponentData
    3. {
    4.     public int offset;
    5. }
    6.  
    7. struct SkinningBoneTarget : IBufferElementData
    8. {
    9.     public Entity Value;
    10. }
    11.  
    12. public class BoneUpdateSystem : SystemBase
    13. {
    14.     void StoreSkinningMatrix(int offset, in float4x4 value){}
    15.  
    16.     float4x4 CalculateSkinningMatrix(in float4x4 boneLocalToWorld,…){}
    17.  
    18.     protected override void OnUpdate()
    19.     {
    20.         var getSkinningMatrixStorage =
    21.               GetComponentDataFromEntity<SkinningMatrixStorage>();
    22.  
    23.         Entities.ForEach((in Entity  entity,
    24.           in LocalToWorld localToWorld,
    25.           in DynamicBuffer<SkinningBoneTarget> skinningTarget) =>
    26.         {
    27.             for (int i = 0; i < skinningTarget.Length; ++i)
    28.             {
    29.                StoreSkinningMatrix(
    30.                  getSkinningMatrixStorage[skinningTarget[i].Value].offset,
    31.                  CalculateSkinningMatrix(localToWorld.Value,…));
    32.             }
    33.         }).ScheduleParallel();
    34.     }
    35. }
    36.  
    To explain why this is faster, let’s first recap some fundamental ECS concepts. An Entity is a handle that provides access to a collection of Components. Every unique combination of component types is called an Archetype. Component data for Entities belonging to the same Archetype are put in contiguous memory locations - a list of Chunks. Thus, ECS ensures that iterating over Component data of Entities of the same Archetype is cache-efficient.

    By adding a DynamicBuffer<SkinningBoneTarget> to every bone Entity, we force the bone Entities’ data to be in the same collection of Chunks. Entities.ForEach will iterate over these Chunks linearly, so we ensure reading the bone’s LocalToWorld matrix has a high cache hit rate.

    It is true that now we are using GetComponentFromEntity on the skinned mesh Entities instead, which makes this part of the code slower. However, in a typical scene, there are only a few skinned mesh Entities so all the data should fit in the CPU cache; while there are usually a lot more bone entities. (*)

    (*) Careful readers may also ask “If the only reason we’re accessing the SkinningBoneTarget Entity is to access its SkinningMatrixStorage component to get the offset, couldn’t we just store the offset instead of an Entity when we set everything up and save the indirection?” In our project, actually we also need the LocalToWorld matrix of the SkinningBoneTarget since our skinning needs to happen in the object space of the SkinningBoneTarget. Otherwise DOTS camera frustum culling would not work. We hid this detail for clarity of exposition.
     
    Last edited: Feb 25, 2021
  5. victorluishongchau

    victorluishongchau

    Unity Technologies

    Joined:
    Feb 7, 2017
    Posts:
    13
    Accelerating iterations with Component

    Generalizing our skinning example, when you iterate over a collection of Entities fulfilling a certain criteria, it is often worth adding a common Component to these Entities. In this way, we help ECS “categorize data better”.

    Let’s say originally some of the Entities you are iterating over have Component A,B,C, and D. Their component data may be interleaved with component data of other Entities with Component A,B,C,D that you do not care about. Adding Component E to the Entities you care about forces them to be stored in separate Chunks, separating them from those Entities you do not care about in your iteration. This change of memory layout ensures that your iteration will have a higher cache hit rate.


    Whilst adding Components to group Entities together can be an useful optimization, it's important not to over-use this technique. Having a large number of unique Component combinations in runtime results in a large number of Archetypes. Since each Chunk only contains Entities of a single Archetype, we’ll get a large number of Chunks each containing a small number of Entities. This situation is called "Chunk fragmentation". Because currently Chunks are always a fixed size of 16KB, Chunk fragmentation can result in wasted memory. Moreover, iterating over a group of Entities spanning many different Chunks may result in lots of cache misses and less efficient CPU usage.


    Note that just having a large theoretical number of Archetypes is not a problem in itself, since the actual distribution of Entities may be concentrated in only a few Archetypes. But if you observe that the Chunk Utilization metric of an expensive System update is low using the Entity Debugger, there may be room to “defragment your Chunks” and you may get a performance boost.


    If you do encounter Chunk fragmentation, the following approaches may help alleviate the situation:
    • If you were using Component without data to distinguish Entities, instead distinguish by checking for a bit-flag in a Component present in all Entities. Here is an example of how this could be achieved. This approach makes the iteration using the bit-flag slower, but increases Chunk occupancy so other Entities.ForEach may be faster.

      Looking ahead, the DOTS development team plans to support disabling Components in future releases. When that feature is ready, we can simply add disabled Components without data to all Entities, and use their Enabled state to distinguish Entities - this would achieve a similar outcome but more neatly and efficiently.
    • Instead of using a Component with some data to distinguish Entities, you can maintain a singleton Entity with a DynamicBuffer of “Entity reference plus some data”. Then, we can access these Entities’ Component data using GetComponentFromEntity. Here is an example. This would make the iteration using GetComponentFromEntity slower, but increases Chunk occupancy so other Entities.ForEach may be faster. (*)

      (*) By “Entity reference”, we mean having a field of type Entity, which is a handle to some Component data. It is not to be confused with
      references in C#. We’ll always use “reference” in the general sense of the word rather than the C# specific sense.

      Although using GetComponentFromEntity may be slow in DOTS standard, random access by indirection like this is very common in object-oriented programs. Therefore, it may not be too bad - especially if your DynamicBuffer is relatively short.
    • Combining two Systems into one and getting them to share a Component is also a way to address this. Again, the combined System will be slower, but it keeps other Systems fast.
    • Finally, consider splitting Entity with lots of Components into a number of Entities with fewer Components. This makes iterating over two Components originally in the same Entity slower, but could help keep Chunk Utilization higher.
    In other words, when Chunk fragmentation happens, it is worth trying to make some “less important” Systems slower to make “more important” Systems faster. There isn’t a “perfect solution” without tradeoffs. Therefore, it’s advisable to keep profiling to ensure progress is made with each refactor.
     
    Last edited: Feb 25, 2021
  6. victorluishongchau

    victorluishongchau

    Unity Technologies

    Joined:
    Feb 7, 2017
    Posts:
    13
    Character modification

    The player character is often modified at runtime. A chain (simulated by physics) may be tied to the character’s feet; the character may wear a cape on their back; accessories such as glasses may be attached to the character’s head. To serve these purposes, our character modification system needs to focus on 3 areas:
    • The assets for character modification, such as Meshes, Materials and Prefabs, need to be appropriately stored in and loaded from disk. We need a resource management system for DOTS Unity like the AssetBundle system for object-oriented Unity.
    • The characters are skinned meshes, and we may attach extra skinned meshes on top. Without hidden surface removal of mesh attachments close to the character’s body surface, there will be visual artefacts as the character is animated. Therefore, we need a system handling dynamic hidden surface removal.
    • After the assets are loaded and hidden surface removal completed, we need to remove existing modifications, instantiate new Entities and attach them to the existing character. Structural changes such as instantiation, adding Components, and destroying Entities are often bottlenecks in DOTS project performance. We needed to find a way to perform structural changes in a fast and maintainable fashion.
     
    Last edited: Feb 25, 2021
  7. victorluishongchau

    victorluishongchau

    Unity Technologies

    Joined:
    Feb 7, 2017
    Posts:
    13
    Resource management

    Modification command Prefabs

    Character modification data starts as Prefabs in our project’s Asset folder. They share the bone hierarchy structure of the characters and have various additional GameObjects attached. For example, a “wear banana suit Prefab” would be the whole character’s skeletal hierarchy (with no Component attached) plus a SkinnedMeshRenderer referencing the skeletal hierarchy. Let’s call these the “character modification Prefabs”.

    In the original project, we compare the hierarchy of these “character modification Prefabs” to the character bone hierarchy, and merge additional GameObjects to the character at runtime.

    During our port, we optimized the project by converting each “character modification Prefab” into a number of “modification command Prefabs”. Each “modification command Prefab” is composed of one Prefab instantiation followed by some kind of attachment to the character in a simple and explicit manner, such as:

    “Instantiate the glasses and make it the child of the nose bone; with this LocalToParent transform”

    “Instantiate this cape and skin the mesh using spine skeleton bone 1, 2, and 3; with these bind pose matrices”

    “Instantiate this chain and attach the root to the character’s left foot using this joint”


    Since we handled the complexity of merging at build time, at runtime our modification system simply has to run through each “modification command Prefab” and perform instantiation and attachment one after the other. On modern hardware, memory latency is often a greater performance bottleneck than CPU cycles; our optimization avoids unnecessary queries of data structures at runtime and ensures the character modification process is as fast as possible.

    Entity Scene usages

    Next, we’ll have to store these “modification command Prefabs” in disk and load them.

    In object-oriented Unity, assets are stored in AssetBundles. In a DOTS project, the parallel is Entity Scenes. Another important concept in DOTS is BlobAssets, which is a little like ScriptableObject in object-oriented Unity. A great introduction to both topics is this talk from Unite Copenhagen. This talk from Unity LA explains how DOTS asset-loading works in a little more detail.



    To briefly recap, Conversion Systems take GameObjects and generate Entities and BlobAssets. Components from Entities may still refer to some UnityEngine.Objects such as Meshes and Textures; these references are managed by the ECS framework internally. When building an Entity Scene, Entities’ Component data in Chunks and BlobAsset data is stored in disk together as binary data, while the referenced UnityEngine.Objects’ data is built into a number of supporting AssetBundles.

    At runtime, loading BlobAssets is very fast, involving no deserialization. Loading Entities is just as fast, except we need to remap the Entity Indices, in case the Entity Indices in the Entity Scene data in disk overlap with Entity Indices in the current world. Loading referenced UnityEngine.Objects from AssetBundles is relatively slower since it may involve extra decompression and deserialization.

    Currently, the concept of DOTS Prefab is quite simple. It is a LinkedEntityGroup of Entities with Prefab Components attached. The presence of Prefab Components ensures DOTS Prefabs do not get returned from EntityQueries unless “explicitly asked to”.

    Some projects may bake GameObject Prefabs attached with ConvertToEntity Components into AssetBundles. Instantiating these GameObject Prefabs will trigger GameObject to Entity conversion at runtime, and DOTS Entities will indeed get loaded in the end. However, this is not a recommended approach, because GameObject to Entity conversion should happen as much as possible in build time rather than in runtime. Be sure to use Entity Scenes in your DOTS project instead, otherwise you may get memory wastage and CPU spikes.

    Our “modification command Prefabs” are stored in Entity Scenes in the following fashion:



    BlobAsset is very flexible and lets us build arbitrarily nested arrays of data structures. We can thus encode a collection of “modification commands” into one Modification BlobAsset. For example, “attaching a chain to the left foot” and “wearing a pair of glasses” is contained in the same Modification BlobAsset for “modification into a short-sighted criminal”.

    As mentioned, each “modification command” is associated with a Prefab to instantiate. However, BlobAsset cannot contain Entity references. To resolve this, we attach a Component referencing our Modification BlobAsset to an Entity with a DynamicBuffer of Prefabs. Then, our BlobAsset can refer to Prefabs by index in that DynamicBuffer.

    The DOTS Physics system also uses BlobAsset to store Joint and Collider data. When attaching instantiated Prefab to an existing Entity using joints, we need to create a Joint Entity at runtime, which requires Joint data. Collider data is useful during hidden surface removal. These extra BlobAssets are also kept in DynamicBuffers. Our main BlobAsset refers to them by index.

    Entities and BlobAssets are much faster to load compared to UnityEngine.Object because they do not need to go through deserialization. In order to have the zero-deserialization feature, they are not as easy to handle as UnityObject sometimes. Entities can keep DynamicBuffers to structure data, but it’s not possible to nest any further within the same Entity; BlobAsset can nest easily, but it cannot contain Entity references.

    In a way, these constraints are good, since they encourage us to think about the data and avoid over-abstraction. After some data-oriented design, we have found that a good mixture of Entities, Prefabs and BlobAssets in an Entity Scene is flexible enough to encode most types of “resource data”.

    Inter-Prefab references

    A final issue we needed to address is inter-Prefab references. Let’s imagine we have a full-body banana costume. We’ll first need to instantiate some extra Entities and attach them to the character via joints for physics simulation; then we’ll need to attach a banana skinned mesh animated by both the character’s existing bones and the extra Entities we just instantiated. This means the second “modification command” needs to refer to Entities from the first.



    To express references to particular Entities inside another Prefab, each Prefab root keeps a DynamicBuffer of those Entities in the Prefab that external references point to. Thus, this DynamicBuffer translates indices of its element (which can be safely referred to from outside) to appropriate Entity references inside the Prefab.

    When a Prefab is instantiated, a clone of the LinkedEntityGroup is created. Then the clone’s references to Entities inside the original Prefab are automatically remapped to their corresponding Entities in the clone. Thus, internal references held by our DynamicBuffer in a Prefab are appropriately remapped during instantiation.

    This pattern has been great for gracefully handling inter-Prefab references in our project.

    Speaking of remapping during instantiation, one should be careful about instantiating very large Prefabs. When performing remapping during instantiation, the Entities package iterates through Entity references in Components inside the Prefab, and searches for any appropriate value to remap. Currently, this operation may scale more-than-linearly, and so care should be taken instantiating large Prefabs in DOTS. Fortunately for us, our Prefabs are always small.
     
    Last edited: Feb 25, 2021
    EduardoLauer, NotaNaN and Shinyclef like this.
  8. victorluishongchau

    victorluishongchau

    Unity Technologies

    Joined:
    Feb 7, 2017
    Posts:
    13
    Hidden surface removal

    Ray-casting with PhysicsWorld

    To perform hidden surface removal, we simply ray-cast from each triangle vertex across a small distance to detect collision with mesh colliders, and then remove triangles with any collisions. Fortunately, our project’s models do not have a large polygon count, which makes this approach feasible. Otherwise, extra acceleration structure like Signed Distance Field may be needed, and part of the process may better be carried out in GPU using a Compute Shader.

    In the original project, this process involves a lot of calls to Physics.Raycast on the main thread, and to put on a (say) banana suit would take around 27ms on a good PC. Using ray-casting from DOTS Physics over multiple worker threads, we can manage nearly 100x the workload within the same 27ms:



    DOTS Physics is in fact mostly written in Burst-compilable C# code on top of only Native Collections and Jobs; the majority of code in DOTS Physics is independent from the ECS framework in the Entities package. Every frame, Component data is read from the ECS framework and used to build a PhysicsWorld, simulation occurs, and then the result is exported back to the ECS world.

    Since we already have physics simulation happening in our world, and since ray-casting for hidden surface removal is unrelated to this physics simulation, we directly used a separate PhysicsWorld instance just for ray-casting purposes. In particular, we:

    When calling CollisionWorld.CastRay, you can provide a ICollector<RaycastHit> structure. Using this class, we can handle the event of each RaycastHit and perform incremental change; we can also early-out the ray-cast when needed. This is useful for our project, since in fact our hidden surface removal has complex rules over if “X can occlude Y”.


    Ray-casting and structural change

    The reason ray-casting is faster in our DOTS port is because we are doing it over many threads and because our Burst-compiled ray-casting Jobs are running more cache-efficient instructions utilizing SIMD. However, the most interesting part of our hidden surface removal system is its compatibility with DOTS structural change.

    DOTS can be very fast and parallel when reading / writing component data. However, when a Component is added or when an Entity is instantiated / destroyed, the data structure supporting our fast Entities.ForEach needs to be updated. This process cannot happen while other Jobs are accessing ECS data. Therefore, often when structural change happens, Unity’s worker threads are idle.

    Since our hidden surface removal occurs in a separate PhysicsWorld to the main gameplay systems, we are scheduling Jobs independent from any ECS data. Therefore, while structural changes happen and our Entities.ForEach code cannot run on worker threads, it is safe to run our ray-casting Jobs. In a way, we are doing our ray-casting on worker threads “for free”.

    In our project, we are removing and adding character modification / performing structural changes on the main thread. At the same time, we perform hidden surface removal / ray-casting on worker threads. Finally, we sync at the main thread to submit the needed change to our Mesh.


    Generalizing our ray-casting example, when one performs a complex simulation, often it is faster to use specialized algorithms and data-structures. In these cases, one can use ECS memory as the source of truth and accordingly maintain these custom data-structures using Native Collections. For example, like we just mentioned, DOTS Physics maintains a bounding volume hierarchy for collision detection.

    With an algorithm working on memory in Native Collections outside of ECS, it is often a good idea to perform these algorithms in Jobs during ECS structural change, for example during EntityCommandBuffer.Playback.
     
    Last edited: Mar 17, 2021
  9. victorluishongchau

    victorluishongchau

    Unity Technologies

    Joined:
    Feb 7, 2017
    Posts:
    13
    Structural change handling

    In our project, we maintain a NativeQueue which is filled with character modification commands via NativeQueue.ParallelWriter. Then just before BeginSimulationEntityCommandBufferSystem, we go through the NativeQueue and execute structural changes directly via EntityManager APIs such as EntityManager.Instantiate. This approach is simple and fast enough for our needs.

    Since structural change performance is an important topic in DOTS, we performed a post-mortem investigation to find out more about various approaches to handle this type of composite structural changes.


    We wrote a "weapon changing benchmark” here. Let’s assume that every frame, for each character, we need to:
    • Destroy the old weapon
    • Instantiate the new weapon
    • Add Parent and LocalToParent Components to the new weapon
    • Call SetComponent to keep an external handle to the newly instantiated weapon
    We profiled an IL2CPP release build with 4096 weapon-changing characters on an Alienware R15 R3 with a 3.1 GHz Octa-Core Intel Core i9 CPU. (*) (**)

    (*) In the benchmark, we are calling AddComponent twice on the same Entity to add 2 different Components. It is true that using EntityManager.SetArchetype will be faster than calling EntityManager.AddComponent twice. But we omit this detail for clarity of benchmark comparisons.

    (**) The theme is “minions with lightsabers” : )

    Reorder and batch

    The standard approach to perform structural change is through an EntityCommandBuffer. One would encode the structural changes one wants to perform into the EntityCommandBuffer, and later on the main thread, the structural changes will be played back. Our first approach, Approach0, uses EntityCommandBuffer and the structural change itself takes on average 8.03ms.

    If there are many structural changes in a frame, it is much faster to perform them in a single batch. For example, calling EntityManager.DestroyEntity on single Entities 1000 times is significantly slower than calling EntityManager.DestroyEntity once with a NativeArray of 1000 Entities. Approach1 collects all commands to change weapons using a NativeQueue, and then destroys old weapons / add Components to new weapons in batch using EntityManager APIs. Structural change time is reduced to 3.87ms.

    The performance improvement of Approach1 mainly comes from our ability to reorder and batch structural change commands. In general, if structural changes are a performance bottleneck, it may be worth attempting to reorder and batch the changes in user code before submitting commands to EntityManager or EntityCommandBuffer.

    EntityManager API overhead

    However, EntityManager API is in general slower than equivalent API from EntityCommandBuffer. Approach2 collects commands to change weapons using a NativeQueue, and then executes each structural change command one by one using EntityManager APIs. The average structural change time for this approach is 11.28ms - much slower than Approach0, which uses EntityCommandBuffer to do the same thing.

    Before and after the actual structural change in APIs like EntityManager.AddComponentData, some book-keeping is required. For example, we need to ensure all Jobs accessing Component data are completed before the structural change; we also need to ensure that each EntityQueries' cache of “Chunks to be iterated over” is still correct after the structural change. Using EntityCommandBuffer, this happens once before and after the whole EntityCommandBuffer playback; but using EntityManager API, these “preprocessing” and “postprocessing” happen much more frequently - once for every call into EntityManager API.

    Moreover, EntityCommandBuffer calls into a single Burst-compiled playback function that performs all the recorded structural changes one after the other. In comparison, although the core structural change codes of EntityManager APIs are also Burst-compiled, the code iterating over commands from a NativeQueue is not Burst-compiled. Also, we are calling into Burst-compiled code with a smaller scope numerous times, once for each structural change.


    To verify these are the reasons for the performance difference between EntityManager and EntityCommandBuffer APIs, Approach3 neatly modified the Entity package and eliminated these overheads from Approach2. Our structural change time drops to 7.55ms.

    As expected, structural change time of Approach3 and Approach0 becomes comparable. In fact, it is slightly faster, possibly because EntityCommandBuffer returns a temporary Entity Index when the user calls EntityCommandBuffer.Instantiate and there is overhead fixing subsequent references to this temporary Entity Index to its proper value during playback.

    Our finding about EntityManager overhead means Approach1 can be made even faster. Approach4 is essentially Approach1 but with the EntityManager API overhead removed. Structural change time drops to just 2.85ms.

    Balancing maintainability with performance

    Although Approach4 is nearly 3 times as fast as our original method, Approach0, it requires changes to the Entity package, which is not very maintainable. In the future, ECB will support taking a NativeArray of Entities as an argument for APIs such as AddComponent. When this is available, a good compromise between maintainability and performance may be the following:
    • Receive character modification command from NativeQueue.ParallelWriter
    • Reorder and batch the commands on main thread every frame
    • Record the batched commands into ECB on main thread
    • Let ECB playback the batched structural changes immediately after
    Here is a sketch of this method - it’s just a sketch since we are relying on yet-to-come API. Note that it is important to Burst-compile the code performing reordering / batching and the code writing to EntityCommandBuffer to ensure best performance.
     
    Last edited: Feb 25, 2021
  10. victorluishongchau

    victorluishongchau

    Unity Technologies

    Joined:
    Feb 7, 2017
    Posts:
    13
    Conclusion

    We briefly explored scripting and resource management in DOTS. We also highlighted 4 coding patterns we found useful as we develop our project:
    • Design the Archetypes and EntityQueries to get the most data in contiguous memory - but be careful not to fragment your Chunks too much.
    • Use an appropriate mix of BlobAssets, Prefabs and Entities in Entity Scenes for resource management. DynamicBuffer is useful to map array index (persistent) to Entity Index (not persistent), and helps glue everything together.
    • Structural changes cannot happen at the same time as Jobs access ECS data. Therefore, this period is a great opportunity to schedule Jobs working on memory outside of ECS - for example some acceleration structures built just using Native Collections.

    • Structural changes are relatively slow (in DOTS standard), but performance can be improved by using a Structural Change Handler pattern that restructures the work in a frame to batch common operations.
    In the end, we did manage to get close to 100x speed-up with our character modification system. In the same 27ms that the previous version took to modify one character, our new version managed to modify more than 90 characters.

    We also benchmarked our skinning system. Skinning 1000 characters on a 3.1 GHz Octa-Core Intel Core i9 CPU , our DOTS skinning system achieves a frame time of ~1.8ms, while the equivalent skinning system using Unity’s built-in GPU skinning takes ~6.1ms. The speed up is mainly due to more parallelism. Comparing total CPU usage, our DOTS skinning system is only slightly more efficient, taking ~5.5ms vs Unity’s built-in GPU skinning taking ~6.5ms. Presumably, this smaller gain is because here the comparison is with a built-in feature written in C++ inside Unity, rather than with user-side C# scripts based on Monobehaviour.

    Finally, let’s remind ourselves that one should not be carried away with premature optimizations. For example, if your DOTS project encounters performance bottlenecks with structural changes, hopefully our analysis is helpful for your optimization efforts. Otherwise, using EntityCommandBuffer or EntityManager API “the natural way” may be the best way forward.

    DOTS is a great platform for engineers with an interest in algorithms and data-structures to quickly refactor and optimize programs to their heart's content. We look forward to seeing the technology evolve further, and to leveraging it with more customers to realize their creativity and vision.

    We're looking forward to your feedback and questions. For bigger discussions or topics not immediately related to the content in this post, please start new threads! : )
     
    Last edited: Mar 23, 2021
    tarahugger, LTK, Haneferd and 12 others like this.
  11. Shinyclef

    Shinyclef

    Joined:
    Nov 20, 2013
    Posts:
    505
    What a fantastic write up. I have read this all again a couple of times at least!

    I'm fascinated by your structural change analysis section in particular. My takeout is that for performance critical changes, command buffers are not the job? Sounds like we should fill our own 'buffers' (native queues etc) of grouped commands and batch them with Entity Manager right next to an existing sync point?

    This, together with your modification of the entities packages really goes to show how deep the optimisation rabbit hole can go. Are these modifications going to make it into a future release?
     
    victorluishongchau likes this.
  12. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,264
    Very interesting! Changing the skin matrices and meshes from a scatter to a gather is really clever and likely helps a lot in GPU-bound projects (which is nearly every ambitious project I ever tackle). Since you avoided the animation package, what are you using for animation sampling?

    You might be interested in this, where I explored batching using a similar but slightly different approach and saw similar speedups: Custom Command Buffers Optimization Article
     
  13. Shinyclef

    Shinyclef

    Joined:
    Nov 20, 2013
    Posts:
    505
    Wow that was an interesting read. I will be bookmarking that and giving it a second read when I need to look at sync points in future.
     
  14. victorluishongchau

    victorluishongchau

    Unity Technologies

    Joined:
    Feb 7, 2017
    Posts:
    13
    Command buffer is reasonably good but the approach outlined above is essentially "making our own specialized command buffer" so it is bound to be faster as long as the "reordering and batching" is not following a criteria so complex that the "reordering and batching" becomes the dominant cost. The price to pay it more engineering complexity - the game may change and the reorder-and-batch's assumption may no longer be valid so need to be correspondingly modified. Like other optimization it makes the code base more rigid somehow. : /

    The modification example here inserts user-specific reorder-and-batch logic into entities package so exactly this change definitely will not go into future release. I'm not aware of any plan to let user implement their custom command buffer either. However there is plan to allow command buffer to perform batched command. So the answer for now is going to be as we mentioned in the end of that section:

    Thanks for your nice comment, it means a lot to get a friendly first reply : )
     
    Last edited: Mar 24, 2021
    NotaNaN likes this.
  15. victorluishongchau

    victorluishongchau

    Unity Technologies

    Joined:
    Feb 7, 2017
    Posts:
    13
    There is no animation curve in our project because it is completely physics driven. It is however not uncommon for user to make their own animation curve implementation using BlobAssets. Often the cost of sampling animation curve can be dominated by "searching for the key-frame before and after time T" rather than interpolation itself, so optimizing that search is probably the focus of implementations like this. Merging key frames of all animation curves into a single track ordered by time and advancing incrementally (remembering information from last frame) might be a good starting point. This does not help random playback at time T though. For that maybe some acceleration structure on top of a binary search is good.

    I'll have to read this in more detail, but the optimization adventures blog looks impressive - also it's cool how you are blogging in GitHub - kudos +1
     
    NotaNaN and DreamingImLatios like this.
  16. Sab_Rango

    Sab_Rango

    Joined:
    Aug 30, 2019
    Posts:
    121
    Hey! These new DOTS tech is the illustrating the future!
    But, I would say Unity should also do improve "Physics engine " along with team "ML agents". Since In order to create "physics-based ML agent characters" request both knowledge.
    So I think this ECS tech should be verified along with games' AI changing technology "physics AI agents":)
     
    FilmBird and Ruchir like this.
  17. Resshin27

    Resshin27

    Joined:
    Apr 21, 2018
    Posts:
    31
    I am new to DOTS Tech, and this Resource Subscene concept clears many of my doubts regarding Asset Management in DOTS.
    BTW Thanks for sharing this illuminating breakdown!
    Need more of these kind of real world applications and insights. For a novice like me, articles like this help a lot!!
     
    victorluishongchau likes this.
  18. victorluishongchau

    victorluishongchau

    Unity Technologies

    Joined:
    Feb 7, 2017
    Posts:
    13
    I think currently a lot of physics for AI is done with classic Unity physics for example judging from this post (Prototype your industrial designs using Unity's new ArticulationBody feature (unity3d.com)). There are experiment with using DOTS for AI indeed but I am not familiar with the latest news there. I agree that DOTS in principle is a good use case for AI training especially because performance means faster training. Maybe if you make a separate thread that will attract the relevant person to give you a more informed reply : )
     
    NotaNaN and Sab_Rango like this.
  19. Knil

    Knil

    Joined:
    Nov 13, 2013
    Posts:
    66
    Sounds like the game Gang Beasts?
     
  20. kite3h

    kite3h

    Joined:
    Aug 27, 2012
    Posts:
    197


    It was created in the same way as above. It's been a while since it was made.
    The background and the Terrain collider are DOTS.
    SkinRenderer is also a hybrid renderer.
    Only animations use Animator.

    The character controller also modified the code in the Physics sample and synchronized it with the Animator. But I think Bone GameObjects are unnecessary. I think that BoneGameObjects can also be removed by transforming the clip.

    I envy you. I can port game to Unity DOTS by myself, but the support team( Unity Korea ) in my country tells my boss ' Do not use DOTS.'

    Anyway your Matrix code is wrong. You must multiply Bind Pose matrix and inverse root matrix for root motion control.
     
    Last edited: May 7, 2021
  21. kite3h

    kite3h

    Joined:
    Aug 27, 2012
    Posts:
    197
    It is not simple problems. Deep learning consists trainging and inference.

    Unity has baracuda for inference. AI for game is inference not training.
     
  22. MidnightCow

    MidnightCow

    Joined:
    Jun 2, 2017
    Posts:
    30
    TheVague, mbaker, FlavioIT and 9 others like this.
  23. hippocoder

    hippocoder

    Digital Ape

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    The supporting images are still broken.