"Sharding" NativeMultiHashMaps for better write performance

Abbrew · Jan 14, 2019

I've identified writing to NativeMultiHashMap.Concurrent as the bottleneck in my jobs. To address this I've been considering writing a simple ShardedNativeMultiHashMapN container where N is the number of partitions, each one a NativeMultiHashMap. Any key-value pair about to be written will hash its key and modulus it by N. The result is which NativeMultiHashMap it will be stored in. ShardedNativeMultiHashMap5 would look like this:

Code (CSharp):

public struct ShardedNativeMultiHashMap5<K, V> : IDisposable where K : struct, IEquatable<K> where V : struct

{

[ReadOnly]

public NativeMultiHashMap<K, V> cache1;

[ReadOnly]

public NativeMultiHashMap<K, V> cache2;

[ReadOnly]

public NativeMultiHashMap<K, V> cache3;

[ReadOnly]

public NativeMultiHashMap<K, V> cache4;

[ReadOnly]

public NativeMultiHashMap<K, V> cache5;

public Concurrent ToConcurrent(){

// Return a struct containing Concurrent versions of each NativeMultiHashMap

}

}

A potential problem is deciding which size to have for each partition. We could have it be K/N where K is the intended size of the overall NativeMultiHashMap. However, if hashing produces lopsided destination partitions, one partition could reach its size limit well before the others. In this case you would have to flush that cache. This nasty side effect is inappropriate for many applications but acceptable for, say, a NativeLRUCache implementation (I could post it if you guys want). Flushing data at unpredictable times is a performance issue, not a correctness one.
What do you guys think? Is this a good idea, or which improvements should be applied?

tertle · Jan 14, 2019

Abbrew said: ↑

I've identified writing to NativeMultiHashMap.Concurrent as the bottleneck in my jobs.
Click to expand...

I'm kind of surprised by this, how many objects are you adding to your hash?

My quick tests show I can add
10k items in 0.14ms
100k items in 0.8ms
to a NativeMultiHashMap over 3 cores on an old cpu.
That's pretty quick for a hashmap to me.

Not saying it won't bottleneck at some point and I think the multiple locks are the biggest delay (but what choice do you have), just seems strange that you're adding 100k+ objects to hashmaps every frame and/or that this is the biggest point of job delay.

Abbrew · Jan 14, 2019

tertle said: ↑

I'm kind of surprised by this, how many objects are you adding to your hash?

My quick tests show I can add
10k items in 0.14ms
100k items in 0.8ms
to a NativeMultiHashMap in about 0.8ms over 3 cores on an old cpu.
That's pretty quick for a hashmap to me.

Not saying it won't bottleneck at some point and I think the multiple locks are the biggest delay (but what choice do you have), just seems strange that you're adding 100k+ objects to hashmaps every frame and/or that this is the biggest point of job delay.
Click to expand...

Thank you for running some tests. That is very odd, for me adding 10k items is taking around 40-80ms. I must be doing something wrong. Here is my offending job:

Code (CSharp):

[BurstCompile]

private struct RemoveDuplicatesJob : IJob

{

[ReadOnly]

public NativeArray<Request> requests;

[WriteOnly]

public NativeHashMap<Request, int>.Concurrent existingRequests;

[WriteOnly]

public NativeList<int> nonDuplicateRequests;

public void Execute()

{

int length = requests.Length;

for (int index = 0; index < length; index++)

{

if (existingRequests.TryAdd(requests[index], 0))

{

nonDuplicateRequests.Add(index);

}

}

}

}

This takes 80ms for 10k requests. The other time-consuming code segment is in the "adding result" profiler section:

Code (CSharp):

Profiler.BeginSample("raycast");

int numHits = Physics.RaycastNonAlloc(start,ray,hitsBuffer,ray.magnitude,request.collisionMask);

Profiler.EndSample();

if(numHits == 0)

{

output.AddResult(request, new GetAllHitsCalculationResult());

Array.Clear(hitsBuffer, 0, hitsBuffer.Length);

continue;

}

int currentHit = 0;

while (currentHit < numHits && !hitsBuffer[currentHit].point.Equals(default(Vector3)))

{

RaycastHit hit = hitsBuffer[currentHit];

ComponentDataWrapper<Obstacle> obstacleComponent = hit.transform.GetComponent<ComponentDataWrapper<Obstacle>>();

if(obstacleComponent != null)

{

Obstacle obstacle = obstacleComponent.Value;

Profiler.BeginSample("adding result");

output.AddResult(request, new GetAllHitsCalculationResult

{

point = hitsBuffer[currentHit].point,

obstacle = obstacle

});

Profiler.EndSample();

}

else

{

Debug.LogWarning("An entity with in the obstacle layer does not have an obstacle script!");

}

currentHit++;

}

Profiler.EndSample();

This takes 40ms for 10k raycasts. The output variable is of type ResultCache, which uses a NativeLRUCache. Here is what the Set method of NativeLRUCache (used by AddResult in ResultCache) looks like

Code (CSharp):

public void Set(K k ,V v)

{

cache.Add(k, v);

int tailValue = System.Threading.Interlocked.Increment(ref *tail) - 1;

keysPositions.TryAdd(tailValue, k);

valuesPositions.TryAdd(tailValue, v);

cacheLinesPositions.TryAdd(k, tailValue);

// tail.Increment();

}

Overall, I'm expecting just 1000 requests per frame at the most. Do you see anything that might be contributing to the massive latency I'm getting?

tertle · Jan 14, 2019

A few things

Code (CSharp):

[BurstCompile]

private struct RemoveDuplicatesJob : IJob

{

[ReadOnly]

public NativeArray<Request> requests;

[WriteOnly]

public NativeHashMap<Request, int>.Concurrent existingRequests;

[WriteOnly]

public NativeList<int> nonDuplicateRequests;

public void Execute()

{

int length = requests.Length;

for (int index = 0; index < length; index++)

{

if (existingRequests.TryAdd(requests[index], 0))

{

nonDuplicateRequests.Add(index);

}

}

}

}

Why are you using Concurrent when are you not running it in parallel?

I notice you're also using NativeHashMap instead of NativeMultiHashMap that your original post said so I quickly ran tests again, and it's still only taking 0.14-0.16ms to add 10k elements to hashmap.

I then replicated your job on a single thread and benched it and it only took ~0.29ms for 10k entities.

The only thing I can think of is your your IEquatable on Request is very slow.

-edit- screenshots

10k entities

Time to execute job

Abbrew · Jan 14, 2019

tertle said: ↑

A few things

Code (CSharp):

[BurstCompile]

private struct RemoveDuplicatesJob : IJob

{

[ReadOnly]

public NativeArray<Request> requests;

[WriteOnly]

public NativeHashMap<Request, int>.Concurrent existingRequests;

[WriteOnly]

public NativeList<int> nonDuplicateRequests;

public void Execute()

{

int length = requests.Length;

for (int index = 0; index < length; index++)

{

if (existingRequests.TryAdd(requests[index], 0))

{

nonDuplicateRequests.Add(index);

}

}

}

}

Why are you using Concurrent when are you not running it in parallel?

I notice you're also using NativeHashMap instead of NativeMultiHashMap that your original post said so I quickly ran tests again, and it's still only taking 0.14-0.16ms to add 10k elements to hashmap.

I then replicated your job on a single thread and benched it and it only took ~0.29ms for 10k entities.

The only thing I can think of is your your IEquatable on Request is very slow.

-edit- screenshots

10k entities
View attachment 359185

Time to execute job
View attachment 359188
Click to expand...

Thank you. Here's the IEquatable implementation on both MapNodeRequest and GetAllHitsRequest.

Code (CSharp):

public struct GetAllHitsRequest : IEquatable<GetAllHitsRequest>

{

public float3 start;

public float3 end;

public LayerMask collisionMask;

public bool Equals(GetAllHitsRequest other)

{

return end.CloseTo(other.end, 01f)

&& start.CloseTo(other.start, 0.1f)

&& collisionMask.value == other.collisionMask.value;

}

//public override int GetHashCode()

//{

// int result = 17;

// result = 37 * result + collisionMask.value;

// result = 37 * result + start.GetHashCode();

// result = 37 * result + end.GetHashCode();

// return result;

//}

}

Code (CSharp):

public struct MapNodeRequest : IEquatable<MapNodeRequest>

{

public float2 location;

public bool Equals(MapNodeRequest other)

{

return location.CloseTo(other.location, 0.1f);

}

//public override int GetHashCode()

//{

// return location.GetHashCode();

//}

}

Here's the implementation for CloseTo. It's the same for float2 and float3.

Code (CSharp):

public static bool CloseTo(this float f1, float other, float marginOfError)

{

return Mathf.Abs(f1- other) < marginOfError;

}

Sorry I'm making you do all the testing. If there is any testing info you need to better pinpoint the problem, please tell me and I'll do it

tertle · Jan 14, 2019

Random note

return end.CloseTo(other.end, 01f)

That meant to be 0.1f

Abbrew · Jan 14, 2019

Thanks for catching that. Don't want unique requests to be considered the same! Here's the root of the issue:

Code (CSharp):

public struct Concurrent

{

[WriteOnly]

private NativeMultiHashMap<K, V>.Concurrent cache;

[WriteOnly]

private NativeHashMap<K, int>.Concurrent cacheLinesPositions;

[WriteOnly]

private NativeHashMap<int, K>.Concurrent keysPositions;

[WriteOnly]

private NativeHashMap<int, V>.Concurrent valuesPositions;

[NativeDisableUnsafePtrRestriction]

private int* tail;

public Concurrent(

NativeMultiHashMap<K, V>.Concurrent cache,

NativeHashMap<int, K>.Concurrent keysPositions,

NativeHashMap<int, V>.Concurrent valuesPositions,

NativeHashMap<K, int>.Concurrent cacheLinesPositions,

int* tail

)

{

this.cache = cache;

this.keysPositions = keysPositions;

this.valuesPositions = valuesPositions;

this.cacheLinesPositions = cacheLinesPositions;

this.tail = tail;

}

public void Set(K k, V v)

{

Profiler.BeginSample("Writing to cache");

cache.Add(k, v);

int tailValue = System.Threading.Interlocked.Increment(ref *tail) - 1;

keysPositions.TryAdd(tailValue, k);

valuesPositions.TryAdd(tailValue, v);

cacheLinesPositions.TryAdd(k, tailValue);

Profiler.EndSample();

}

}

Set(K,V) is taking 47ms for 1000 requests. At first I chalked it up to it accessing 4 different concurrent hashmaps, but similar operations on normal hashmaps are also taking this long.

tertle · Jan 14, 2019

Run this

Code (CSharp):

using Unity.Burst;

using Unity.Collections;

using Unity.Entities;

using Unity.Jobs;

public class NativeHashMapSystem : JobComponentSystem

{

private NativeHashMap<Entity, int> map;

private NativeList<Entity> dupe;

/// <inheritdoc />

protected override void OnCreateManager()

{

const int count = 10000;

this.map = new NativeHashMap<Entity, int>(count, Allocator.Persistent);

this.dupe = new NativeList<Entity>(Allocator.Persistent);

var array = new NativeArray<Entity>(count, Allocator.Temp);

this.EntityManager.CreateEntity(this.EntityManager.CreateArchetype(typeof(Test)), array);

array.Dispose();

}

/// <inheritdoc />

protected override void OnDestroyManager()

{

this.map.Dispose();

this.dupe.Dispose();

}

/// <inheritdoc />

protected override JobHandle OnUpdate(JobHandle handle)

{

this.map.Clear();

this.dupe.Clear();

var job = new AddTest

{

HashMap = this.map.ToConcurrent(),

nonDuplicateRequests = this.dupe,

};

return job.ScheduleSingle(this, handle);

}

[BurstCompile]

private struct AddTest : IJobProcessComponentDataWithEntity<Test>

{

public NativeHashMap<Entity, int>.Concurrent HashMap;

[WriteOnly]

public NativeList<Entity> nonDuplicateRequests;

/// <inheritdoc />

public void Execute(Entity entity, int index, [ReadOnly] ref Test data)

{

if (this.HashMap.TryAdd(entity, index))

{

this.nonDuplicateRequests.Add(entity);

}

}

}

public struct Test : IComponentData

{

}

}

Tell me how long it takes.

Abbrew · Jan 14, 2019

11ms on first frame, .23ms every frame afterwards

tertle · Jan 14, 2019

11ms is going to be resizing the list

Are you recreating the nonDuplicateRequests list every frame?

Abbrew · Jan 14, 2019

If it helps I have [BurstCompile] commented out for each job that writes to a NativeHashMap since Burst doesn't like TryGetValue returning a non-blittable bool

Abbrew · Jan 14, 2019

tertle said: ↑

11ms is going to be resizing the list

Are you recreating the nonDuplicateRequests list every frame?
Click to expand...

Yes. In my project allocation takes .11ms

tertle · Jan 14, 2019

Oh. If you don't have BurstCompile it's going to be 5-10x slower.

Just comparing the example above, I go from 0.28ms to 2.37ms

tertle · Jan 14, 2019

Abbrew said: ↑

Yes
Click to expand...

Don't do that, just cache it. If you really have to recreate it for some reason pre-allocate some large capacity otherwise it's going to be really slow because it's going to have to keep allocating.

Abbrew · Jan 14, 2019

@tertle I took a look at the Equals method. When I implement it as

Code (CSharp):

return true;

the system takes 1ms to run. Still an order of magnitude higher than when Entity is the key, but better then a correct implementation.

Code (CSharp):

return location.x == other.location.x &&

location.y == other.location.y;

Now the system takes 8.5ms to run.

Code (CSharp):

return location.Equals(other.location);

Now now the system takes 10ms to run!
I think the problem lies elsewhere - even when Equals is implemented as

Code (CSharp):

return true;

the Hashmap runs 10x slower

tertle · Jan 15, 2019

Are you able to provide the whole system / update method so i can test it myself?

Abbrew · Jan 15, 2019

tertle said: ↑

Are you able to provide the whole system / update method so i can test it myself?
Click to expand...

Sort of solved the issue. The reason why the hashmap operations were so slow was because of the GetHashCode implementation. Let's use GetAllHitsRequest as an example

Code (CSharp):

public struct GetAllHitsRequest : IEquatable<GetAllHitsRequest>

{

public float3 start;

public float3 end;

public LayerMask collisionMask;

public bool Equals(GetAllHitsRequest other)

{

return end.CloseTo(other.end, 01f)

&& start.CloseTo(other.start, 0.1f)

&& collisionMask.value == other.collisionMask.value;

}

//public override int GetHashCode()

//{

// int result = 17;

// result = 37 * result + collisionMask.value;

// result = 37 * result + start.GetHashCode();

// result = 37 * result + end.GetHashCode();

// return result;

//}

}

float3.GetHashCode probably computes the hash values of 3 floats. Each hash computation of a float takes an obscene amount of time. By casting the float3 to an int3 first, I was able to reduce the latency 7x.

Code (CSharp):

public struct GetAllHitsRequest : IEquatable<GetAllHitsRequest>

{

public readonly float3 start;

public readonly float3 end;

public readonly LayerMask collisionMask;

public GetAllHitsRequest(

float3 start,

float3 end,

LayerMask collisionMask)

{

this.start = start;

this.end = end;

this.collisionMask = collisionMask;

}

public bool Equals(GetAllHitsRequest other)

{

//return true;

return end.Equals(other.end)

&& start.Equals(other.start)

&& collisionMask.value == other.collisionMask.value;

//return end.CloseTo(other.end, 0.1f)

//&& start.CloseTo(other.start, 0.1f)

//&& collisionMask.value == other.collisionMask.value;

}

public override int GetHashCode()

{

int result = 17;

result = 37 * result + collisionMask.value;

result = 37 * result + ((int3)start).GetHashCode();

result = 37 * result + ((int3)end).GetHashCode();

return result;

}

}

@tertle Here's the link to the GetAllHits system. https://github.com/KontosTwo/DidymosECS/tree/master/Assets/ECS/Environment/Physics/GetAllHits
Thank you so much for the help along the way

Abbrew · Jan 15, 2019

I still think that something is increasing latency. Adding 1000 elements now takes 1.5ms instead of something like .15 ms which is in the range of how much it should

tertle · Jan 15, 2019

Makes sense that the hash code is the bottleneck.

The Entity.GetHashCode just returns the index key

Code (CSharp):

public override int GetHashCode()

{

return Index;

}

So it's extremely quick for comparison. I don't really have any great advice on optimizing the method (except you should probably wrap it in unsafe because overflow is fine.)

I guess you could potentially make use of knowledge of your data, like if your game area is size limited you could potentially make use of that or maybe if your start locations are always the same you could ditch that from the hash.

snacktime · Jan 15, 2019

Didn't take time to relate the last bits of code to the original problem. But it seems to me there has to be a way to just avoid creating duplicates to start with. Do the check where you insert and then just don't insert.

You need to pass a specific layermask to a raycast. So the layermask is a natural partition point there. If it was me I'd just future proof the thing by actually using RaycastCommand to hold your data. If you care about performance the next natural step is jobified raycasts anyways.

Abbrew · Jan 16, 2019

snacktime said: ↑

Didn't take time to relate the last bits of code to the original problem. But it seems to me there has to be a way to just avoid creating duplicates to start with. Do the check where you insert and then just don't insert.

You need to pass a specific layermask to a raycast. So the layermask is a natural partition point there. If it was me I'd just future proof the thing by actually using RaycastCommand to hold your data. If you care about performance the next natural step is jobified raycasts anyways.
Click to expand...

RaycastCommand unfortunately has a bug where it only hits one collider regardless of what you specify for the maxHits parameter. I tried implementing my own RaycastCommandMultipleHits function but it was always much slower than plain old Physics.Raycast

MintTree117 · Apr 28, 2020

tertle said: ↑

I'm kind of surprised by this, how many objects are you adding to your hash?

My quick tests show I can add
10k items in 0.14ms
100k items in 0.8ms
to a NativeMultiHashMap over 3 cores on an old cpu.
That's pretty quick for a hashmap to me.

Not saying it won't bottleneck at some point and I think the multiple locks are the biggest delay (but what choice do you have), just seems strange that you're adding 100k+ objects to hashmaps every frame and/or that this is the biggest point of job delay.
Click to expand...

Do you really get those numbers? 200k inserts is 5ms single threaded and 8 parallel for me...

tertle · Apr 28, 2020

martygrof3708 said: ↑

Do you really get those numbers? 200k inserts is 5ms single threaded and 8 parallel for me...
Click to expand...

It's a good question because I find trying to replicate the test the results are not nearly as good than when i posted that all that time back.

Search Unity

Unity ID

Useful Searches

"Sharding" NativeMultiHashMaps for better write performance