Search Unity

[See proposed memclear solution] Clear() on large NativeMultiHashMaps is causing performance issues.

Discussion in 'Data Oriented Technology Stack' started by Mr-Mechanical, Feb 12, 2019.

  1. Mr-Mechanical

    Mr-Mechanical

    Joined:
    May 31, 2015
    Posts:
    191
    Hi,

    I'm developing an algorithm which heavily relies on NativeMultiHashMaps (at least somewhat similar to BoidsSystem use case).

    The main concern is that the HashMap has to be reset every frame which causes a major performance obstacle when using Clear(). This issue only applies to large hashmaps. Perhaps this doesn't have to be the case.

    Any suggestions for a workaround on clearing hashmaps? I've tried disposing and reallocating with Allocator.TempJob and it is understandably less optimal for performance than Clear(). Is Clear() doing something significant with regards to performance? How shall I address this issue? Is there another path to "resetting" the HashMap without affecting performance?

    Thanks, the advice is appreciated!
     
    GliderGuy likes this.
  2. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    1,666
    Clear requires iterating the entire hashmap so there are definitely performance issues for large maps.
    Allocating a hashmap also calls Clear() so it's going to have the same issues and just adds an extra allocation on top of Clear.

    I'm unaware of any workarounds for large hashmaps.
     
    Mr-Mechanical likes this.
  3. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    918
    @tertle - I never used you event system, but if I understand it right, I could create an entity for each "Key" with a buffer array for the values?

    Q1: would that be any faster?
    Q2: would it introduce 1-frame delay?
     
    Mr-Mechanical likes this.
  4. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    1,666
    No idea if that would be faster, it's an interesting use I haven't considered.
    They are usually 1 frame delayed, but you could keep it same frame by executing after the EndBarrierSystem
     
    Last edited: Feb 12, 2019
    Mr-Mechanical likes this.
  5. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    1,666
    If you've ever looked at source, it's pretty obvious what the problem is

    Code (CSharp):
    1.         public static unsafe void Clear(NativeHashMapData* data)
    2.         {
    3.             int* buckets = (int*) data->buckets;
    4.             for (int i = 0; i <= data->bucketCapacityMask; ++i)
    5.                 buckets[i] = -1;
    6.             int* nextPtrs = (int*) data->next;
    7.             for (int i = 0; i < data->keyCapacity; ++i)
    8.                 nextPtrs[i] = -1;
    9.             for (int tls = 0; tls < JobsUtility.MaxJobThreadCount; ++tls)
    10.                 data->firstFreeTLS[tls * NativeHashMapData.IntsPerCacheLine] = -1;
    11.             data->allocatedIndexLength = 0;
    12.         }
    It needs to allocate everything to -1 to mark it not used.
    If you could make 0 the unused state you could just memclear everything and it'd be super fast. I'm not sure what the consequences are but for example you could start the arrays at 1 or offset when indexing by -1.
     
  6. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    918
    I had a quick glance at the source but I don't have the skill & patience to fiddle through it, since I only do this as a hobby. But it seemed like with some effort one could memcpy the values out of the buckets...

    i.e. i store collision candidates (values) in a spacial grid (gird = key) and i need to get this into native arrays per grid to iterate pair by pair. currently i just unroll by iterating through all keys/values in a job.
     
    Mr-Mechanical likes this.
  7. Mr-Mechanical

    Mr-Mechanical

    Joined:
    May 31, 2015
    Posts:
    191
    @Joachim_Ante This seems like a cool idea, do you think you guys can optimize hashmap Clear()?
     
  8. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    918
    I tried my idea from above, but without using the event system from @tertle

    I basically replaced my nativemultihashmap with the "sngdan entity->buffer dictionary" - it seems to be faster in my use case, although I can not write to the buffer concurrently, which I could to the hashmap.

    Will test this more and if there is something interesting to report post it here
     
    Mr-Mechanical likes this.
  9. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    918
    forgot to mention earlier - besides the speed increase, it is also nice that you can watch the "Dictionary" contents in the entity debugger.
     
    Mr-Mechanical likes this.
  10. jooleanlogic

    jooleanlogic

    Joined:
    Mar 1, 2018
    Posts:
    332
    Do you know just how big it is? Are we talking megabytes? I'm surprised this is so slow when it's just iterating a couple of arrays.

    Would MemSet make any difference or is zero a special case that gets optimised?
    Code (CSharp):
    1. //int* buckets = (int*) data->buckets;
    2. //for (int i = 0; i <= data->bucketCapacityMask; ++i)
    3. //    buckets[i] = -1;
    4. UnsafeUtilityEx.MemSet(data->buckets, 0xFF, data->bucketCapacityMask * sizeof(int));
    Curious if replacing those two loops with MemSet would make any difference.
     
    Mr-Mechanical likes this.
  11. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    918
    @Mr-Mechanical

    Try the approach that I suggested, there is a good chance that you benefit from this as well.
    • I have not checked at what point it allocates on the heap (so I guess, there will be a speed penalty at a certain amount of values stored per "key") - I have to read up on this at one point
    • Clearing the buffer is fast ( m_Buffer->Length = 0; )
    • It's fast to access. I.e. you can just get all values per key as a native array .AsNativeArray() -- no copying of data
    • Considerations: You cannot write from different threads / in parallel and you might have to complete the job that fills the "Dictionary" (I had to, because I parallel process the values per key later)

    edit: I have now 50,000 random AABBs moving, colliding, color change on collision, rendering (with a bare bone custom render system that used MPB for the colors) on my IMac 3.3GHz (4 cores) at 60fps
     
    Mr-Mechanical likes this.
  12. Mr-Mechanical

    Mr-Mechanical

    Joined:
    May 31, 2015
    Posts:
    191
    I am fascinated with your approach. However I am curious, how would you control the values of each "key"?
     
  13. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    918
    I am not sure, I understand you question. What do you mean control?

    edit: I clear the buffers each frame, before I write to them again - similar to what you would also do with the hashmap

    Code (CSharp):
    1. // MultiHashMap
    2. public struct CollisionInfo: IComponentData
    3. {
    4.     public Entity entity;
    5.     public Box box;
    6. }
    7.  
    8. // in the job
    9.         [WriteOnly]    public NativeMultiHashMap<int, CollisionInfo>.Concurrent entitiesPerGridMultiHashMap;
    10.                         entitiesPerGridMultiHashMap.Add(key, new CollisionInfo {entity = e, box = box}); // key is an int referring to a grid position
    11.  
    12. // "Entity Dictionary" or whatever you want to call it
    13. public struct CollisionInfoBuffer : IBufferElementData
    14. {
    15.     public Entity entity;
    16.     public Box box;
    17. }
    18.  
    19. // in the job
    20.         [WriteOnly] public BufferArray<CollisionInfoBuffer> keyArray;
    21. keyArray[key].Add(new CollisionInfoBuffer{entity = e, box = box}); // key = as above
    22.  
     
    Mr-Mechanical likes this.
  14. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    918
    Why don’t you describe a little better, what exactly you are trying to do and if I have an idea how to approach it, I will let you know.
     
    Mr-Mechanical likes this.
  15. Mr-Mechanical

    Mr-Mechanical

    Joined:
    May 31, 2015
    Posts:
    191
    Ok, that makes sense. Initially, I thought you were creating many entities with dynamic buffers to represent key/values. I have made sense of your implementation based on your source provided. Only concern is the potential hash collisions but that isn't a problem for use in some algorithms.

    I've tested this replacement for Clear() using memset, however, unfortunately, according to my tests it has not improved the performance:

    Code (CSharp):
    1.  
    2.             UnsafeUtilityEx.MemSet(data->buckets, 0xFF, data->bucketCapacityMask * sizeof(int));
    3.             UnsafeUtilityEx.MemSet(data->next, 0xFF, data->keyCapacity * sizeof(int));
    4.             for (int tls = 0; tls < JobsUtility.MaxJobThreadCount; ++tls)
    5.                 data->firstFreeTLS[tls * NativeHashMapData.IntsPerCacheLine] = -1;
    6.             data->allocatedIndexLength = 0;
    However, very I'm curious if MemClear however improves performance. I attempted to write a MemClear version (see below) however it crashes in builds and in editor testing. @tertle Any tips on how to use MemClear properly? I have never used MemClear before.

    Code (CSharp):
    1.  UnsafeUtility.MemClear(data->buckets, data->bucketCapacityMask * sizeof(int));
    2.             UnsafeUtility.MemClear(data->next, data->keyCapacity * sizeof(int));
    3.             for (int tls = 0; tls < JobsUtility.MaxJobThreadCount; ++tls)
    4.                 data->firstFreeTLS[tls * NativeHashMapData.IntsPerCacheLine] = -1;
    5.             data->allocatedIndexLength = 0;
    This may be a general problem and not specific to this implementation:
    https://bugs.openjdk.java.net/browse/JDK-4343436

    I may consider creating something similar to a HashMap but very focused on efficiently clearing and reuse.

    Thank you all for the suggestions! It means a lot.
     
  16. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    918
    I create many entities with dynamic buffers.
    KeyArray[key] -> the keyarray holds the different entities and I pick the one with index key to store a value
     
    Mr-Mechanical likes this.
  17. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    1,666
    You can't just use memclear, the nativehashmap expects the values to be -1 and won't work unless they are initialized to -1. You need to redesign the entire hashmap to expect 0 to be default.
     
    GliderGuy and Mr-Mechanical like this.
  18. Mr-Mechanical

    Mr-Mechanical

    Joined:
    May 31, 2015
    Posts:
    191
    I see. Do you think it would be somehow possible to test a broken hashmap using memclear for performance gains? I want to see if memclear would make the difference before considering an attempt to redesign on my own.



    Currently, I have a test comparing Clear() of a Persistent Hashmap and Dispose of a TempJob NativeArray. Here is the size of each when the performance is the same:

    NativeArray: 40000
    NativeMultiHashMap: 10000

    Could this 4x difference be natural as hashmap Clear() has to do more work? Or could there be some potential for optimization? This discussion so far has been great. Thanks. I will continue to look further into this as well.
     
  19. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    1,666
    Just create a simple test of iterating 3 large sets of data and setting them to -1 and then compare it to 3 memclears of the same length.
     
    Mr-Mechanical likes this.
  20. RecursiveEclipse

    RecursiveEclipse

    Joined:
    Sep 6, 2018
    Posts:
    102
    I'm working on a NativeHashSet based on NativeHashMap and got close enough done today to be able to test this. For now I'm using memclear to 0, then point all array set/get indexers to a function that either adds/subtracts for set/get respectively. Still haven't touched Reallocate() yet, could use memclear there also.

    Clearing then adding 10000 ints per frame, before memclear(.135-.165ms):


    After(.012-.018ms):



    NativeHashMap has 4 arrays while a NativeList only needs 1 so a ~4x hit compared to NativeList makes sense.
     
    Last edited: Feb 14, 2019
    GilCat and Mr-Mechanical like this.
  21. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    918
    Most of the time, I post from mobile like now. I could post an example later.

    @Mr-Mechanical
    Do you have pre-determined keys / hashes or do you need those flexible in (a) number and (b) key value? Is the key value of importance to you later or is is just for grouping?
     
    Mr-Mechanical likes this.
  22. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    918
    Code (CSharp):
    1.  
    2. using Unity.Burst;
    3. using Unity.Collections;
    4. using Unity.Entities;
    5. using Unity.Jobs;
    6. using Unity.Mathematics;
    7.  
    8. namespace EntityHashMapExample
    9. {  
    10.     public struct DummyData: IComponentData
    11.     {
    12.         public int Value;
    13.     }
    14.    
    15.     [InternalBufferCapacity(16)]
    16.     public struct EntityHashMapValues : IBufferElementData
    17.     {
    18.         public Entity entity;
    19.     }
    20.    
    21.     [UpdateBefore(typeof(EntityHashMapJobSystem))]
    22.     public class SetDummyDataSystem: ComponentSystem
    23.     {
    24.         protected override void OnCreateManager()
    25.         {
    26.             for (int i = 0; i < EntityHashMapJobSystem.numberOfValues; i++)
    27.             {
    28.                 EntityManager.CreateEntity(ComponentType.Create<DummyData>());
    29.             }
    30.         }
    31.        
    32.         protected override void OnUpdate()
    33.         {
    34.             ForEach( (ref DummyData data) =>
    35.             {
    36.                 data.Value = UnityEngine.Random.Range(0,EntityHashMapJobSystem.numberOfKeys);
    37.  
    38.             }, GetComponentGroup(typeof(DummyData)));
    39.         }
    40.     }
    41.    
    42.     public class EntityHashMapJobSystem : JobComponentSystem
    43.     {
    44.         public ComponentGroup key_Group;
    45.         public ComponentGroup value_Group;  
    46.         public const int numberOfKeys = 1000;
    47.         public const int numberOfValues = 50000;
    48.        
    49.         [BurstCompile]
    50.         private struct ClearEntityHashMapJob : IJobParallelFor
    51.         {
    52.             [NativeDisableParallelForRestriction][WriteOnly] public BufferArray<EntityHashMapValues> keyEntities;
    53.        
    54.             public void Execute(int i)
    55.             {  
    56.                 keyEntities[i].Clear();
    57.             }
    58.         }
    59.        
    60.         [BurstCompile]
    61.         private struct FillEntityHashMapJob : IJobProcessComponentDataWithEntity<DummyData>
    62.         {
    63.             [WriteOnly] public BufferArray<EntityHashMapValues> keyEntities;
    64.        
    65.             public void Execute(Entity e, int i, [ReadOnly] ref DummyData data)
    66.             {  
    67.                 var key = data.Value;
    68.                 keyEntities[key].Add(new EntityHashMapValues{entity = e});
    69.             }
    70.         }
    71.  
    72.         protected override void OnCreateManager()
    73.         {
    74.             for (int i = 0; i < numberOfKeys; i++)
    75.             {
    76.                 EntityManager.CreateEntity(ComponentType.Create<EntityHashMapValues>());
    77.             }
    78.             key_Group = GetComponentGroup(ComponentType.Create<EntityHashMapValues>());
    79.         }
    80.    
    81.         protected override JobHandle OnUpdate(JobHandle inputDependencies)
    82.         {
    83.             var valuesArray = key_Group.GetBufferArray<EntityHashMapValues>();
    84.             inputDependencies = new ClearEntityHashMapJob
    85.             {
    86.                 keyEntities = valuesArray
    87.             }.Schedule(valuesArray.Length, 8, inputDependencies);
    88.            
    89.             inputDependencies = new FillEntityHashMapJob
    90.             {
    91.                 keyEntities = valuesArray
    92.             }.ScheduleSingle(this, inputDependencies); // one thread only ! (because, I can not gurantee not to write from multiple threads to same key)
    93.            
    94.             inputDependencies.Complete();
    95.            
    96.             // you can open the entity inspector and watch them
    97.             // get all the values stored in "key 0" as a native array ! (note: key is the position in array - my system simplifies the key part, can be done differently)
    98.             var valuesOfKeyAsNativeArray = valuesArray[0].AsNativeArray(); // this is a pointer to the buffer, nothing is copied
    99.  
    100.             return inputDependencies;
    101.         }
    102.     }
    103. }
    104.  
     
    Mr-Mechanical likes this.
  23. Mr-Mechanical

    Mr-Mechanical

    Joined:
    May 31, 2015
    Posts:
    191
    This is a very impressive (10 fold) performance improvement you have implemented here, it's clear memclear would make the difference in this case. I hope a similar optimization is made in the official MultiNativeHashMap. Thank you for sharing.
     
    GliderGuy likes this.
  24. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    918
    the buffer solution removes the clearing bottleneck all together, you wont be able to fill it fast enough :)
     
  25. Mr-Mechanical

    Mr-Mechanical

    Joined:
    May 31, 2015
    Posts:
    191
    This is interesting and thank you for the suggestion. Though unfortunately, being able to add to the hashmap from multiple threads is an absolute must for me.
     
    sngdan likes this.
  26. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    918
    Parallel only works, if you can ensure not to write to the same hash from multiple threads (like, the clear job in my example)

    In my collision system, I have tried 4 different approaches and funnily my first, fully parallel, choice was not the fastest (which I thought it would) - the problem is that I only test on my computer (4 cores) and I don’t know how it would behave on different hardware.

    The memclear proposal seems great, optimizing one bottleneck, hopefully there is also one for getting the values out into an array (instead of unrolling sequentially)
     
    Mr-Mechanical likes this.
  27. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    918
    @Mr-Mechanical

    I updated to the new ECS version and as if by magic, it seems parallel write now works with my entity buffer dictionary (must be thread safe now) - i did not see this in the release notes.

    Have to test this thoroughly in the evening - maybe I am just hallucinating...
     
  28. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    918
    @tertle - did you not maintain a diff or so? can you see anything there?


    False alert - it seems like it is still not working.
    - I believe that Unity used to crash or did not compile before
    - I can now schedule a job that was previous scheduled as "single" in "parallel" and it mostly works, although I get console errors from time to time with index out of range (it almost looks like the atomic write to the buffer works but the buffer length/capacity is not adjusted across threads)
    - I guess at one point I have to dig into this or make a simple test case to figure out what is happening
     
    Last edited: Feb 19, 2019
  29. GilCat

    GilCat

    Joined:
    Sep 21, 2013
    Posts:
    417
    I'm having the same problem and what i did was to clear the NativeMultiHashMap inside a job with burst:
    Code (CSharp):
    1.   [BurstCompile]
    2.   public struct ClearNativeHashMap<TKey, TValue> : IJob
    3.     where TKey : struct, IEquatable<TKey>
    4.     where TValue : struct {
    5.     [WriteOnly]
    6.     public NativeMultiHashMap<TKey, TValue> MultiHashMap;
    7.  
    8.     public void Execute() {
    9.       MultiHashMap.Clear();
    10.     }
    11.   }
     
    alexzzzz and Mr-Mechanical like this.
  30. Mr-Mechanical

    Mr-Mechanical

    Joined:
    May 31, 2015
    Posts:
    191
    I figured out how to make my hashmap size smaller so Clear() is less of a dramatic problem now. I am interested in your approach. Are you seeing notable gains just by having a burst compiled job or is the gains from not running on the main thread?
     
  31. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    918
    With the help of @RecursiveEclipse, it is possible to write to the buffers from a parallel job (with interlock waiting in case of collision) - in my test case with similar of slightly better performance than single.

    I have a number of versions now with queues, multihashmap, buffers and varying degrees of parallelism.

    It has been fun but I reached a level of micro optimization, that i believe is now hardware specific (ie dependent on cores, etc)

    My preference and good performance is the buffer solution, mainly because of the ease of access to the stored values (asnative array) and the fast clear
     
    Mr-Mechanical likes this.
  32. Razmot

    Razmot

    Joined:
    Apr 27, 2013
    Posts:
    262
    Could you use two hashmaps, switch the active one, clear the inactive one later, like a double buffering pattern ?