Search Unity

"Sharding" NativeMultiHashMaps for better write performance

Discussion in 'Entity Component System' started by Abbrew, Jan 14, 2019.

  1. Abbrew

    Abbrew

    Joined:
    Jan 1, 2018
    Posts:
    417
    I've identified writing to NativeMultiHashMap.Concurrent as the bottleneck in my jobs. To address this I've been considering writing a simple ShardedNativeMultiHashMapN container where N is the number of partitions, each one a NativeMultiHashMap. Any key-value pair about to be written will hash its key and modulus it by N. The result is which NativeMultiHashMap it will be stored in. ShardedNativeMultiHashMap5 would look like this:
    Code (CSharp):
    1. public struct ShardedNativeMultiHashMap5<K, V> : IDisposable where K : struct, IEquatable<K> where V : struct
    2. {
    3.     [ReadOnly]
    4.     public NativeMultiHashMap<K, V> cache1;
    5. [ReadOnly]
    6.     public NativeMultiHashMap<K, V> cache2;
    7. [ReadOnly]
    8.     public NativeMultiHashMap<K, V> cache3;
    9. [ReadOnly]
    10.     public NativeMultiHashMap<K, V> cache4;
    11. [ReadOnly]
    12.     public NativeMultiHashMap<K, V> cache5;
    13.  
    14.     public Concurrent ToConcurrent(){
    15.     // Return a struct containing Concurrent versions of each NativeMultiHashMap
    16.     }
    17. }
    A potential problem is deciding which size to have for each partition. We could have it be K/N where K is the intended size of the overall NativeMultiHashMap. However, if hashing produces lopsided destination partitions, one partition could reach its size limit well before the others. In this case you would have to flush that cache. This nasty side effect is inappropriate for many applications but acceptable for, say, a NativeLRUCache implementation (I could post it if you guys want). Flushing data at unpredictable times is a performance issue, not a correctness one.
    What do you guys think? Is this a good idea, or which improvements should be applied?
     
  2. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,761
    I'm kind of surprised by this, how many objects are you adding to your hash?

    My quick tests show I can add
    10k items in 0.14ms
    100k items in 0.8ms
    to a NativeMultiHashMap over 3 cores on an old cpu.
    That's pretty quick for a hashmap to me.

    Not saying it won't bottleneck at some point and I think the multiple locks are the biggest delay (but what choice do you have), just seems strange that you're adding 100k+ objects to hashmaps every frame and/or that this is the biggest point of job delay.
     
    Last edited: Jan 14, 2019
    Abbrew likes this.
  3. Abbrew

    Abbrew

    Joined:
    Jan 1, 2018
    Posts:
    417
    Thank you for running some tests. That is very odd, for me adding 10k items is taking around 40-80ms. I must be doing something wrong. Here is my offending job:

    Code (CSharp):
    1. [BurstCompile]
    2.     private struct RemoveDuplicatesJob : IJob
    3.     {
    4.         [ReadOnly]
    5.         public NativeArray<Request> requests;
    6.         [WriteOnly]
    7.         public NativeHashMap<Request, int>.Concurrent existingRequests;
    8.         [WriteOnly]
    9.         public NativeList<int> nonDuplicateRequests;
    10.  
    11.         public void Execute()
    12.         {
    13.             int length = requests.Length;
    14.             for (int index = 0; index < length; index++)
    15.             {
    16.                 if (existingRequests.TryAdd(requests[index], 0))
    17.                 {
    18.                     nonDuplicateRequests.Add(index);
    19.                 }
    20.             }
    21.         }
    22.     }
    This takes 80ms for 10k requests. The other time-consuming code segment is in the "adding result" profiler section:
    Code (CSharp):
    1. Profiler.BeginSample("raycast");
    2.             int numHits = Physics.RaycastNonAlloc(start,ray,hitsBuffer,ray.magnitude,request.collisionMask);
    3.             Profiler.EndSample();
    4.  
    5.             if(numHits == 0)
    6.             {
    7.                 output.AddResult(request, new GetAllHitsCalculationResult());
    8.                 Array.Clear(hitsBuffer, 0, hitsBuffer.Length);
    9.                 continue;
    10.             }
    11.  
    12.  
    13.             int currentHit = 0;
    14.             while (currentHit < numHits && !hitsBuffer[currentHit].point.Equals(default(Vector3)))
    15.             {
    16.                 RaycastHit hit = hitsBuffer[currentHit];
    17.                 ComponentDataWrapper<Obstacle> obstacleComponent = hit.transform.GetComponent<ComponentDataWrapper<Obstacle>>();
    18.  
    19.                 if(obstacleComponent != null)
    20.                 {
    21.                     Obstacle obstacle = obstacleComponent.Value;
    22.                     Profiler.BeginSample("adding result");
    23.                     output.AddResult(request, new GetAllHitsCalculationResult
    24.                     {
    25.                         point = hitsBuffer[currentHit].point,
    26.                         obstacle = obstacle
    27.                     });
    28.                     Profiler.EndSample();
    29.                 }
    30.                 else
    31.                 {
    32.                     Debug.LogWarning("An entity with in the obstacle layer does not have an obstacle script!");
    33.                 }
    34.  
    35.                 currentHit++;
    36.             }
    37.             Profiler.EndSample();
    This takes 40ms for 10k raycasts. The output variable is of type ResultCache, which uses a NativeLRUCache. Here is what the Set method of NativeLRUCache (used by AddResult in ResultCache) looks like
    Code (CSharp):
    1. public void Set(K k ,V v)
    2.     {
    3.         cache.Add(k, v);
    4.         int tailValue = System.Threading.Interlocked.Increment(ref *tail) - 1;
    5.         keysPositions.TryAdd(tailValue, k);
    6.         valuesPositions.TryAdd(tailValue, v);
    7.         cacheLinesPositions.TryAdd(k, tailValue);
    8.         // tail.Increment();
    9.     }
    Overall, I'm expecting just 1000 requests per frame at the most. Do you see anything that might be contributing to the massive latency I'm getting?
     
  4. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,761
    A few things

    Code (CSharp):
    1. [BurstCompile]
    2.     private struct RemoveDuplicatesJob : IJob
    3.     {
    4.         [ReadOnly]
    5.         public NativeArray<Request> requests;
    6.         [WriteOnly]
    7.         public NativeHashMap<Request, int>.Concurrent existingRequests;
    8.         [WriteOnly]
    9.         public NativeList<int> nonDuplicateRequests;
    10.         public void Execute()
    11.         {
    12.             int length = requests.Length;
    13.             for (int index = 0; index < length; index++)
    14.             {
    15.                 if (existingRequests.TryAdd(requests[index], 0))
    16.                 {
    17.                     nonDuplicateRequests.Add(index);
    18.                 }
    19.             }
    20.         }
    21.     }
    Why are you using Concurrent when are you not running it in parallel?

    I notice you're also using NativeHashMap instead of NativeMultiHashMap that your original post said so I quickly ran tests again, and it's still only taking 0.14-0.16ms to add 10k elements to hashmap.

    I then replicated your job on a single thread and benched it and it only took ~0.29ms for 10k entities.

    The only thing I can think of is your your IEquatable on Request is very slow.

    -edit- screenshots

    10k entities
    upload_2019-1-15_8-0-8.png

    Time to execute job
    upload_2019-1-15_8-0-26.png
     
    Abbrew likes this.
  5. Abbrew

    Abbrew

    Joined:
    Jan 1, 2018
    Posts:
    417
    Thank you. Here's the IEquatable implementation on both MapNodeRequest and GetAllHitsRequest.
    Code (CSharp):
    1. public struct GetAllHitsRequest : IEquatable<GetAllHitsRequest>
    2. {
    3.     public float3 start;
    4.     public float3 end;
    5.     public LayerMask collisionMask;
    6.  
    7.     public bool Equals(GetAllHitsRequest other)
    8.     {
    9.         return end.CloseTo(other.end, 01f)
    10.             && start.CloseTo(other.start, 0.1f)
    11.             && collisionMask.value == other.collisionMask.value;
    12.     }
    13.  
    14.  
    15.     //public override int GetHashCode()
    16.     //{
    17.     //    int result = 17;
    18.     //    result = 37 * result + collisionMask.value;
    19.     //    result = 37 * result + start.GetHashCode();
    20.     //    result = 37 * result + end.GetHashCode();
    21.     //    return result;
    22.     //}
    23. }
    Code (CSharp):
    1. public struct MapNodeRequest : IEquatable<MapNodeRequest>
    2. {
    3.     public float2 location;
    4.  
    5.  
    6.  
    7.  
    8.     public bool Equals(MapNodeRequest other)
    9.     {
    10.         return location.CloseTo(other.location, 0.1f);
    11.     }
    12.     //public override int GetHashCode()
    13.     //{
    14.     //    return location.GetHashCode();
    15.     //}
    16. }
    17.  
    Here's the implementation for CloseTo. It's the same for float2 and float3.
    Code (CSharp):
    1. public static bool CloseTo(this float f1, float other, float marginOfError)
    2.     {
    3.         return Mathf.Abs(f1- other) < marginOfError;
    4.     }
    Sorry I'm making you do all the testing. If there is any testing info you need to better pinpoint the problem, please tell me and I'll do it
     
    Last edited: Jan 14, 2019
  6. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,761
    Random note

    return end.CloseTo(other.end, 01f)

    That meant to be 0.1f
     
    Abbrew likes this.
  7. Abbrew

    Abbrew

    Joined:
    Jan 1, 2018
    Posts:
    417
    Thanks for catching that. Don't want unique requests to be considered the same! Here's the root of the issue:

    Code (CSharp):
    1. public struct Concurrent
    2.     {
    3.         [WriteOnly]
    4.         private NativeMultiHashMap<K, V>.Concurrent cache;
    5.         [WriteOnly]
    6.         private NativeHashMap<K, int>.Concurrent cacheLinesPositions;
    7.         [WriteOnly]
    8.         private NativeHashMap<int, K>.Concurrent keysPositions;
    9.         [WriteOnly]
    10.         private NativeHashMap<int, V>.Concurrent valuesPositions;
    11.  
    12.         [NativeDisableUnsafePtrRestriction]
    13.         private int* tail;
    14.  
    15.  
    16.  
    17.         public Concurrent(
    18.             NativeMultiHashMap<K, V>.Concurrent cache,
    19.              NativeHashMap<int, K>.Concurrent keysPositions,
    20.             NativeHashMap<int, V>.Concurrent valuesPositions,
    21.             NativeHashMap<K, int>.Concurrent cacheLinesPositions,
    22.             int* tail
    23.         )
    24.         {
    25.             this.cache = cache;
    26.             this.keysPositions = keysPositions;
    27.             this.valuesPositions = valuesPositions;
    28.             this.cacheLinesPositions = cacheLinesPositions;
    29.             this.tail = tail;
    30.         }
    31.  
    32.  
    33.         public void Set(K k, V v)
    34.         {
    35.             Profiler.BeginSample("Writing to cache");
    36.             cache.Add(k, v);
    37.             int tailValue = System.Threading.Interlocked.Increment(ref *tail) - 1;
    38.             keysPositions.TryAdd(tailValue, k);
    39.             valuesPositions.TryAdd(tailValue, v);
    40.             cacheLinesPositions.TryAdd(k, tailValue);
    41.             Profiler.EndSample();
    42.         }
    43.  
    44.     }
    Set(K,V) is taking 47ms for 1000 requests. At first I chalked it up to it accessing 4 different concurrent hashmaps, but similar operations on normal hashmaps are also taking this long.
     
    Last edited: Jan 14, 2019
  8. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,761
    Run this

    Code (CSharp):
    1.  
    2. using Unity.Burst;
    3. using Unity.Collections;
    4. using Unity.Entities;
    5. using Unity.Jobs;
    6.  
    7. public class NativeHashMapSystem : JobComponentSystem
    8. {
    9.     private NativeHashMap<Entity, int> map;
    10.     private NativeList<Entity> dupe;
    11.  
    12.     /// <inheritdoc />
    13.     protected override void OnCreateManager()
    14.     {
    15.         const int count = 10000;
    16.  
    17.         this.map = new NativeHashMap<Entity, int>(count, Allocator.Persistent);
    18.         this.dupe = new NativeList<Entity>(Allocator.Persistent);
    19.  
    20.         var array = new NativeArray<Entity>(count, Allocator.Temp);
    21.         this.EntityManager.CreateEntity(this.EntityManager.CreateArchetype(typeof(Test)), array);
    22.         array.Dispose();
    23.     }
    24.  
    25.     /// <inheritdoc />
    26.     protected override void OnDestroyManager()
    27.     {
    28.         this.map.Dispose();
    29.         this.dupe.Dispose();
    30.     }
    31.  
    32.     /// <inheritdoc />
    33.     protected override JobHandle OnUpdate(JobHandle handle)
    34.     {
    35.         this.map.Clear();
    36.         this.dupe.Clear();
    37.  
    38.         var job = new AddTest
    39.         {
    40.             HashMap = this.map.ToConcurrent(),
    41.             nonDuplicateRequests = this.dupe,
    42.         };
    43.  
    44.         return job.ScheduleSingle(this, handle);
    45.     }
    46.  
    47.     [BurstCompile]
    48.     private struct AddTest : IJobProcessComponentDataWithEntity<Test>
    49.     {
    50.         public NativeHashMap<Entity, int>.Concurrent HashMap;
    51.  
    52.         [WriteOnly]
    53.         public NativeList<Entity> nonDuplicateRequests;
    54.  
    55.         /// <inheritdoc />
    56.         public void Execute(Entity entity, int index, [ReadOnly] ref Test data)
    57.         {
    58.             if (this.HashMap.TryAdd(entity, index))
    59.             {
    60.                 this.nonDuplicateRequests.Add(entity);
    61.             }
    62.         }
    63.     }
    64.  
    65.     public struct Test : IComponentData
    66.     {
    67.  
    68.     }
    69. }
    Tell me how long it takes.
     
  9. Abbrew

    Abbrew

    Joined:
    Jan 1, 2018
    Posts:
    417
    11ms on first frame, .23ms every frame afterwards
     
  10. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,761
    11ms is going to be resizing the list

    Are you recreating the nonDuplicateRequests list every frame?
     
  11. Abbrew

    Abbrew

    Joined:
    Jan 1, 2018
    Posts:
    417
    If it helps I have [BurstCompile] commented out for each job that writes to a NativeHashMap since Burst doesn't like TryGetValue returning a non-blittable bool
     
  12. Abbrew

    Abbrew

    Joined:
    Jan 1, 2018
    Posts:
    417
    Yes. In my project allocation takes .11ms
     
  13. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,761
    Oh. If you don't have BurstCompile it's going to be 5-10x slower.

    Just comparing the example above, I go from 0.28ms to 2.37ms
     
    Abbrew likes this.
  14. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,761
    Don't do that, just cache it. If you really have to recreate it for some reason pre-allocate some large capacity otherwise it's going to be really slow because it's going to have to keep allocating.
     
  15. Abbrew

    Abbrew

    Joined:
    Jan 1, 2018
    Posts:
    417
    @tertle I took a look at the Equals method. When I implement it as
    Code (CSharp):
    1. return true;
    the system takes 1ms to run. Still an order of magnitude higher than when Entity is the key, but better then a correct implementation.
    Code (CSharp):
    1. return location.x == other.location.x &&
    2.             location.y == other.location.y;
    Now the system takes 8.5ms to run.
    Code (CSharp):
    1.  return location.Equals(other.location);
    2.  
    Now now the system takes 10ms to run!
    I think the problem lies elsewhere - even when Equals is implemented as
    Code (CSharp):
    1. return true;
    the Hashmap runs 10x slower
     
  16. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,761
    Are you able to provide the whole system / update method so i can test it myself?
     
  17. Abbrew

    Abbrew

    Joined:
    Jan 1, 2018
    Posts:
    417
    Sort of solved the issue. The reason why the hashmap operations were so slow was because of the GetHashCode implementation. Let's use GetAllHitsRequest as an example
    Code (CSharp):
    1. public struct GetAllHitsRequest : IEquatable<GetAllHitsRequest>
    2. {
    3.     public float3 start;
    4.     public float3 end;
    5.     public LayerMask collisionMask;
    6.     public bool Equals(GetAllHitsRequest other)
    7.     {
    8.         return end.CloseTo(other.end, 01f)
    9.             && start.CloseTo(other.start, 0.1f)
    10.             && collisionMask.value == other.collisionMask.value;
    11.     }
    12.     //public override int GetHashCode()
    13.     //{
    14.     //    int result = 17;
    15.     //    result = 37 * result + collisionMask.value;
    16.     //    result = 37 * result + start.GetHashCode();
    17.     //    result = 37 * result + end.GetHashCode();
    18.     //    return result;
    19.     //}
    20. }
    float3.GetHashCode probably computes the hash values of 3 floats. Each hash computation of a float takes an obscene amount of time. By casting the float3 to an int3 first, I was able to reduce the latency 7x.

    Code (CSharp):
    1. public struct GetAllHitsRequest : IEquatable<GetAllHitsRequest>
    2. {
    3.     public readonly float3 start;
    4.     public readonly float3 end;
    5.     public readonly LayerMask collisionMask;
    6.  
    7.     public GetAllHitsRequest(
    8.         float3 start,
    9.         float3 end,
    10.         LayerMask collisionMask)
    11.     {
    12.         this.start = start;
    13.         this.end = end;
    14.         this.collisionMask = collisionMask;
    15.     }
    16.  
    17.     public bool Equals(GetAllHitsRequest other)
    18.     {
    19.         //return true;
    20.         return end.Equals(other.end)
    21.         && start.Equals(other.start)
    22.         && collisionMask.value == other.collisionMask.value;
    23.         //return end.CloseTo(other.end, 0.1f)
    24.         //&& start.CloseTo(other.start, 0.1f)
    25.         //&& collisionMask.value == other.collisionMask.value;
    26.     }
    27.  
    28.  
    29.     public override int GetHashCode()
    30.     {
    31.         int result = 17;
    32.         result = 37 * result + collisionMask.value;
    33.         result = 37 * result + ((int3)start).GetHashCode();
    34.         result = 37 * result + ((int3)end).GetHashCode();
    35.         return result;
    36.     }
    37. }
    @tertle Here's the link to the GetAllHits system. https://github.com/KontosTwo/DidymosECS/tree/master/Assets/ECS/Environment/Physics/GetAllHits
    Thank you so much for the help along the way
     
  18. Abbrew

    Abbrew

    Joined:
    Jan 1, 2018
    Posts:
    417
    I still think that something is increasing latency. Adding 1000 elements now takes 1.5ms instead of something like .15 ms which is in the range of how much it should
     
  19. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,761
    Makes sense that the hash code is the bottleneck.

    The Entity.GetHashCode just returns the index key

    Code (CSharp):
    1.         public override int GetHashCode()
    2.         {
    3.             return Index;
    4.         }
    So it's extremely quick for comparison. I don't really have any great advice on optimizing the method (except you should probably wrap it in unsafe because overflow is fine.)

    I guess you could potentially make use of knowledge of your data, like if your game area is size limited you could potentially make use of that or maybe if your start locations are always the same you could ditch that from the hash.
     
    Last edited: Jan 15, 2019
    sngdan and Abbrew like this.
  20. snacktime

    snacktime

    Joined:
    Apr 15, 2013
    Posts:
    3,356
    Didn't take time to relate the last bits of code to the original problem. But it seems to me there has to be a way to just avoid creating duplicates to start with. Do the check where you insert and then just don't insert.

    You need to pass a specific layermask to a raycast. So the layermask is a natural partition point there. If it was me I'd just future proof the thing by actually using RaycastCommand to hold your data. If you care about performance the next natural step is jobified raycasts anyways.
     
  21. Abbrew

    Abbrew

    Joined:
    Jan 1, 2018
    Posts:
    417
    RaycastCommand unfortunately has a bug where it only hits one collider regardless of what you specify for the maxHits parameter. I tried implementing my own RaycastCommandMultipleHits function but it was always much slower than plain old Physics.Raycast
     
  22. MintTree117

    MintTree117

    Joined:
    Dec 2, 2018
    Posts:
    340
    Do you really get those numbers? 200k inserts is 5ms single threaded and 8 parallel for me...
     
  23. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,761
    It's a good question because I find trying to replicate the test the results are not nearly as good than when i posted that all that time back.