Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice
  2. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

Bug NativeHashMap.ParallelWriter TryAdd is NOT Thread Safe (with repo)

Discussion in 'Entity Component System' started by tertle, May 9, 2020.

  1. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,753
    Follow up to https://forum.unity.com/threads/infinite-loop-in-nativehashmap-tryadd.885730/
    Sorry for making another post, but this is kind of a serious issue has been plaguing our project for 6 months (I've had concerns about this for a while due to some weird behavior randomly appearing elsewhere) but I finally got a simple repo.

    Tested in Unity 2019.3.9f1

    Add these to your package manifest
    "com.unity.burst": "1.3.0-preview.12",
    "com.unity.entities": "0.9.1-preview.15",
    this brings in collections 0.7.1-preview.3

    (not using entities 0.10 due to another bug but this still exists in the latest version)

    Add this script

    -edit- I have a more consistent repo here:
    https://forum.unity.com/threads/nat...ot-thread-safe-with-repo.886318/#post-5827030

    Code (CSharp):
    1. using Unity.Burst;
    2. using Unity.Collections;
    3. using Unity.Entities;
    4. using UnityEngine;
    5.  
    6. public class HashMapSystem : SystemBase
    7. {
    8.     private EntityQuery query;
    9.  
    10.     protected override void OnCreate()
    11.     {
    12.         query = GetEntityQuery(ComponentType.ReadOnly<CommonComponent>());
    13.  
    14.         var archetype0 = EntityManager.CreateArchetype(typeof(CommonComponent));
    15.         EntityManager.CreateEntity(archetype0, 4000, Allocator.Temp);
    16.  
    17.         var archetype1 = EntityManager.CreateArchetype(typeof(CommonComponent), typeof(Component1));
    18.         EntityManager.CreateEntity(archetype1, 4000, Allocator.Temp);
    19.  
    20.         var archetype2 = EntityManager.CreateArchetype(typeof(CommonComponent), typeof(Component1), typeof(Component2));
    21.         EntityManager.CreateEntity(archetype2, 4000, Allocator.Temp);
    22.  
    23.         var archetype3 = EntityManager.CreateArchetype(typeof(CommonComponent), typeof(Component1), typeof(Component2), typeof(Component3));
    24.         EntityManager.CreateEntity(archetype3, 4000, Allocator.Temp);
    25.     }
    26.  
    27.     protected override void OnUpdate()
    28.     {
    29.         var hashMap = new NativeHashMap<EntityArchetype, ushort>(query.CalculateChunkCount(), Allocator.TempJob);
    30.  
    31.         Dependency = new WriteJob
    32.             {
    33.                 HashMap = hashMap.AsParallelWriter(),
    34.             }
    35.             .Schedule(query, Dependency);
    36.  
    37.         Dependency = new ReadJob
    38.             {
    39.                 HashMap = hashMap,
    40.             }
    41.             .Schedule(query, Dependency);
    42.  
    43.         hashMap.Dispose(Dependency);
    44.     }
    45.  
    46.     [BurstCompile]
    47.     private struct WriteJob : IJobChunk
    48.     {
    49.         public NativeHashMap<EntityArchetype, ushort>.ParallelWriter HashMap;
    50.  
    51.         public void Execute(ArchetypeChunk chunk, int chunkIndex, int firstEntityIndex)
    52.         {
    53.             DoJob(chunk.Archetype);
    54.         }
    55.  
    56.         private void DoJob(EntityArchetype archetype)
    57.         {
    58.             HashMap.TryAdd(archetype, 1);
    59.         }
    60.     }
    61.  
    62.     [BurstCompile]
    63.     private struct ReadJob : IJobChunk
    64.     {
    65.         [ReadOnly] public NativeHashMap<EntityArchetype, ushort> HashMap;
    66.  
    67.         public void Execute(ArchetypeChunk chunk, int chunkIndex, int firstEntityIndex)
    68.         {
    69.             if (!HashMap.ContainsKey(chunk.Archetype))
    70.             {
    71.                 Debug.LogError("FAILED STATE");
    72.             }
    73.         }
    74.     }
    75.  
    76.     private struct CommonComponent : IComponentData{}
    77.  
    78.     private struct Component1 : IComponentData
    79.     {
    80.         public int Value1;
    81.     }
    82.  
    83.     private struct Component2 : IComponentData
    84.     {
    85.         public int Value1;
    86.         public int Value2;
    87.     }
    88.  
    89.     private struct Component3 : IComponentData
    90.     {
    91.         public int Value1;
    92.         public int Value2;
    93.         public int Value3;
    94.  
    95.     }
    96. }
    97.  
    Hit play and let it run, it'll fail eventually.
    On my machine with a 3900X (lots of cores) it only takes a few seconds to fail, but it might take longer on a machine with less cores.

    upload_2020-5-9_11-43-22.png

    I'm going to log a bug report shortly.

    Case 1245648
     
    Last edited: May 9, 2020
  2. iamarugin

    iamarugin

    Joined:
    Dec 17, 2014
    Posts:
    880
  3. GilCat

    GilCat

    Joined:
    Sep 21, 2013
    Posts:
    676
    I tried that and did not get any problem. I have a 4 core CPU (i5-4670K). It was running for almost an hour.
    Tested using Unity 2020.0a10 with both Entities 0.9.1-preview.15 and Entities 0.10.0-preview.6
     
  4. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,753
    Interesting, I was originally going to write this test with a bunch of jobs that changed the archetypes every frame to simulate our project better but it failed very quickly without changing anything for me.

    I just had 1 test that wouldn't fail in 5 minutes, so I added a job to mix up the chunks constantly and I can consistently fail in a few seconds.

    19.3.9f1

    upload_2020-5-10_7-51-0.png

    upload_2020-5-10_7-51-53.png

    19.3.13f1

    upload_2020-5-10_7-53-11.png

    20.2.0a10

    upload_2020-5-10_7-55-24.png

    New repo if you want to try it

    Code (CSharp):
    1. using Unity.Burst;
    2. using Unity.Collections;
    3. using Unity.Entities;
    4. using UnityEngine;
    5. using Random = Unity.Mathematics.Random;
    6.  
    7. public class HashMapSystem : SystemBase
    8. {
    9.     private EntityQuery query;
    10.     private EndSimulationEntityCommandBufferSystem bufferSystem;
    11.  
    12.     protected override void OnCreate()
    13.     {
    14.         Debug.Log("System Start");
    15.  
    16.         query = GetEntityQuery(ComponentType.ReadOnly<CommonComponent>());
    17.  
    18.         var archetype0 = EntityManager.CreateArchetype(typeof(CommonComponent));
    19.         EntityManager.CreateEntity(archetype0, 4000, Allocator.Temp);
    20.  
    21.         var archetype1 = EntityManager.CreateArchetype(typeof(CommonComponent), typeof(Component1));
    22.         EntityManager.CreateEntity(archetype1, 4000, Allocator.Temp);
    23.  
    24.         var archetype2 = EntityManager.CreateArchetype(typeof(CommonComponent), typeof(Component1), typeof(Component2));
    25.         EntityManager.CreateEntity(archetype2, 4000, Allocator.Temp);
    26.  
    27.         var archetype3 = EntityManager.CreateArchetype(typeof(CommonComponent), typeof(Component1), typeof(Component2), typeof(Component3));
    28.         EntityManager.CreateEntity(archetype3, 4000, Allocator.Temp);
    29.  
    30.         bufferSystem = World.GetOrCreateSystem<EndSimulationEntityCommandBufferSystem>();
    31.     }
    32.  
    33.     protected override void OnUpdate()
    34.     {
    35.         var hashMap = new NativeHashMap<EntityArchetype, ushort>(query.CalculateChunkCount(), Allocator.TempJob);
    36.  
    37.         Dependency = new WriteJob
    38.             {
    39.                 HashMap = hashMap.AsParallelWriter(),
    40.             }
    41.             .Schedule(query, Dependency);
    42.  
    43.         Dependency = new ReadJob
    44.             {
    45.                 HashMap = hashMap,
    46.             }
    47.             .Schedule(query, Dependency);
    48.  
    49.         hashMap.Dispose(Dependency);
    50.  
    51.         var random = new Random((uint)UnityEngine.Random.Range(1, int.MaxValue));
    52.  
    53.         var commandBuffer = bufferSystem.CreateCommandBuffer().ToConcurrent();
    54.  
    55.         Entities.WithAll<CommonComponent>().ForEach((Entity entity, int entityInQueryIndex) =>
    56.         {
    57.             random.InitState((uint)(random.state + entityInQueryIndex));
    58.  
    59.             switch (random.NextInt(0, 6))
    60.             {
    61.                 case 0:
    62.                     commandBuffer.AddComponent<Component1>(entityInQueryIndex, entity);
    63.                     return;
    64.                 case 1:
    65.                     commandBuffer.RemoveComponent<Component1>(entityInQueryIndex, entity);
    66.                     return;
    67.                 case 2:
    68.                     commandBuffer.AddComponent<Component2>(entityInQueryIndex, entity);
    69.                     return;
    70.                 case 3:
    71.                     commandBuffer.RemoveComponent<Component2>(entityInQueryIndex, entity);
    72.                     return;
    73.                 case 4:
    74.                     commandBuffer.AddComponent<Component3>(entityInQueryIndex, entity);
    75.                     return;
    76.                 case 5:
    77.                     commandBuffer.RemoveComponent<Component3>(entityInQueryIndex, entity);
    78.                     return;
    79.             }
    80.  
    81.         }).ScheduleParallel();
    82.  
    83.         bufferSystem.AddJobHandleForProducer(Dependency);
    84.     }
    85.  
    86.     [BurstCompile]
    87.     private struct WriteJob : IJobChunk
    88.     {
    89.         public NativeHashMap<EntityArchetype, ushort>.ParallelWriter HashMap;
    90.  
    91.         public void Execute(ArchetypeChunk chunk, int chunkIndex, int firstEntityIndex)
    92.         {
    93.             DoJob(chunk.Archetype);
    94.         }
    95.  
    96.         private void DoJob(EntityArchetype archetype)
    97.         {
    98.             HashMap.TryAdd(archetype, 1);
    99.         }
    100.     }
    101.  
    102.     [BurstCompile]
    103.     private struct ReadJob : IJobChunk
    104.     {
    105.         [ReadOnly] public NativeHashMap<EntityArchetype, ushort> HashMap;
    106.  
    107.         public void Execute(ArchetypeChunk chunk, int chunkIndex, int firstEntityIndex)
    108.         {
    109.             if (!HashMap.ContainsKey(chunk.Archetype))
    110.             {
    111.                 Debug.LogError("FAILED STATE");
    112.             }
    113.         }
    114.     }
    115.  
    116.     private struct CommonComponent : IComponentData{}
    117.  
    118.     private struct Component1 : IComponentData
    119.     {
    120.         public int Value1;
    121.     }
    122.  
    123.     private struct Component2 : IComponentData
    124.     {
    125.         public int Value1;
    126.         public int Value2;
    127.     }
    128.  
    129.     private struct Component3 : IComponentData
    130.     {
    131.         public int Value1;
    132.         public int Value2;
    133.         public int Value3;
    134.     }
    135. }
    136.  
     
  5. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,753
    Should add, even setting my

    JobsUtility.JobWorkerCount = 4;

    still triggers pretty fast

    upload_2020-5-10_8-4-50.png
     
  6. eizenhorn

    eizenhorn

    Joined:
    Oct 17, 2016
    Posts:
    2,683
    Well I'm on 3900X too. WriteJob ReadJob wihtout code below (mixing components) works without any problems, but with components mixing I see that problem, and after that error and couple of seconds my editor stuck like it dropped in to infinite loop.I guess problem here might be in chunk fragmentation itself which can cause Archetype pointer corruption somehow...
    EDIT: Ah without chunk mixing it also dropped to Failed State after couple of minutes.

    I guess it maybe somehow related to fact that EntityArchetype contains Archetype pointer and IEquateble use Archetype pointer for comparison. But can't see how. As chunks should be stable in jobs execution path...
     
  7. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,753
    That is the original issue I reported here: https://forum.unity.com/threads/infinite-loop-in-nativehashmap-tryadd.885730/

    The TryAdd corrupts the hashmap and the nextptr points to itself and just infinite loops.

    That's the other thing. I have a bunch of evidence of actual chunks being corrupted at the moment but I don't have a repo so I haven't posted it.

    I've had to add this to our project in ChunkDataUtility

    Code (CSharp):
    1. public static byte* GetComponentDataWithTypeRW(Chunk* chunk, int index, int typeIndex, uint globalSystemVersion)
    2.         {
    3.             if (chunk == null)
    4.             {
    5.                 TypeManager.TypeInfo t = TypeManager.GetTypeInfo(typeIndex);
    6.                 Debug.LogError($"Whoops you just crashed but I saved you. In return can you please write TypeInfo {t.TypeIndex} in the console and pass it to Tim and mention null.");
    7.                 return null;
    8.             }
    9.  
    10.             var indexInTypeArray = GetIndexInTypeArray(chunk->Archetype, typeIndex);
    11.  
    12.             if (indexInTypeArray < 0) {
    13.                 TypeManager.TypeInfo t = TypeManager.GetTypeInfo(typeIndex);
    14.                 Debug.LogError($"Whoops you just crashed but I saved you. In return can you please write TypeInfo {t.TypeIndex} in the console and pass it to Tim and mention -1.");
    15.                 return null;
    16.             }
    17.  
    18.             var offset = chunk->Archetype->Offsets[indexInTypeArray];
    19.             var sizeOf = chunk->Archetype->SizeOfs[indexInTypeArray];
    20.  
    21.             // Write Component to Chunk. ChangeVersion:Yes OrderVersion:No
    22.             chunk->SetChangeVersion(indexInTypeArray, globalSystemVersion);
    23.  
    24.             return chunk->Buffer + (offset + sizeOf * index);
    25.         }
    because once about every 40 minutes on average, somehow EntityCommandBuffer.AddComponent fails during the SetComponent part because the typeIndex is not found on the chunk even though the component should have just been added... how is that possible?

    I had been working on assumption that something in our project was somehow breaking it but after spending the last week debugging it and discovering the particular issue in this thread, I'm starting to think otherwise.

    -edit-

    I should add, this seems to happen with or without burst.
     
    Last edited: May 9, 2020
    elcionap, GilCat and eizenhorn like this.
  8. eizenhorn

    eizenhorn

    Joined:
    Oct 17, 2016
    Posts:
    2,683
    Yep I see, I pointed Branimir to this thread, hope He will look at this thread after weekend, it would be very kind of him
     
    nicolasgramlich likes this.
  9. bkaradzic-unity3d

    bkaradzic-unity3d

    Unity Technologies

    Joined:
    Jan 17, 2020
    Posts:
    17
    > The TryAdd corrupts the hashmap and the nextptr points to itself and just infinite loops.

    Does TryAdd ever return false?

    In code:
    var hashMap = new NativeHashMap<EntityArchetype, ushort>(query.CalculateChunkCount(), Allocator.TempJob);

    1. Does it happen if you double the capacity?
    2. Does it happen if you increase capacity by 1?
     
  10. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,753
    The infinite loop is somewhat of a rare case for me, I've only seen it a couple of times, mostly in a build which is when I hooked up a profiler and stepped through and first noticed it.

    From this repo, TryAdd returns false about 3/4 of the time.

    +1 still enters failed state
    x2 still enters failed state
     
  11. eizenhorn

    eizenhorn

    Joined:
    Oct 17, 2016
    Posts:
    2,683
    Note: Infinite loop happens 100% on my machine with code above
     
  12. bkaradzic-unity3d

    bkaradzic-unity3d

    Unity Technologies

    Joined:
    Jan 17, 2020
    Posts:
    17
    This implies it might be race condition somewhere. At first I thought it's issue with tight capacity.

    I plopped your code in one of our Hello Cube examples, but I don't get the issue at all. 2020.1.0b8 + latest packages (but nothing really changed in *HashMap.TryAdd implementation for a while).
     
  13. bkaradzic-unity3d

    bkaradzic-unity3d

    Unity Technologies

    Joined:
    Jan 17, 2020
    Posts:
    17
    Ok I reproduced it, it takes a while in example I'm using...
     
    tertle and eizenhorn like this.
  14. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,753
    Are you running the original or the revised repo? The revised repo here:
    https://forum.unity.com/threads/nat...ot-thread-safe-with-repo.886318/#post-5827030
    seems to reproduces it much faster in general.

    -edit-

    As for nothing changing in a while, this has been an issue for months but we've been working around it. This is a comment from another developer from our code relating to the workaround that was written ~6 months ago.

    // We are in an error state! I'm not sure why this sometimes happens, seems to go back to an internal unity error in NativeHashMap.GetKeyArray


    not claiming GetKeyArray is the issue, that was just his guess at the time

    I'm just in the process of trying to get a stable build running for QA/publishers as we approach content lock so I'm actually investigating problems in detail now which is why I spent a bunch of time trying to get an easy repo for you guys.
     
    Last edited: May 12, 2020
  15. bkaradzic-unity3d

    bkaradzic-unity3d

    Unity Technologies

    Joined:
    Jan 17, 2020
    Posts:
    17
    I was running original code.

    Can you try to go to UnsafeHashMap.cs and repleace TryAddAtomic with this:

    Code (CSharp):
    1.         internal static unsafe bool TryAddAtomic(UnsafeHashMapData* data, TKey key, TValue item, int threadIndex)
    2.         {
    3.             TValue tempItem;
    4.             NativeMultiHashMapIterator<TKey> tempIt;
    5.             if (TryGetFirstValueAtomic(data, key, out tempItem, out tempIt))
    6.             {
    7.                 return false;
    8.             }
    9.  
    10.             // Allocate an entry from the free list
    11.             int idx = AllocEntry(data, threadIndex);
    12.  
    13.             int bucket = key.GetHashCode() & data->bucketCapacityMask;
    14.             // Add the index to the hash-map
    15.             int* buckets = (int*)data->buckets;
    16.  
    17.             if (Interlocked.CompareExchange(ref buckets[bucket], idx, -1) != -1)
    18.             {
    19.                 int* nextPtrs = (int*)data->next;
    20.  
    21.                 do
    22.                 {
    23.                     nextPtrs[idx] = buckets[bucket];
    24.  
    25.                     if (TryGetFirstValueAtomic(data, key, out tempItem, out tempIt))
    26.                     {
    27.                         // Put back the entry in the free list if someone else added it while trying to add
    28.                         do
    29.                         {
    30.                             nextPtrs[idx] = data->firstFreeTLS[threadIndex * UnsafeHashMapData.IntsPerCacheLine];
    31.                         }
    32.                         while (Interlocked.CompareExchange(
    33.                             ref data->firstFreeTLS[threadIndex * UnsafeHashMapData.IntsPerCacheLine]
    34.                             , idx
    35.                             , nextPtrs[idx]
    36.                                ) != nextPtrs[idx]
    37.                         );
    38.  
    39.                         return false;
    40.                     }
    41.                 }
    42.                 while (Interlocked.CompareExchange(ref buckets[bucket], idx, nextPtrs[idx]) != nextPtrs[idx]);
    43.             }
    44.  
    45.             // Write the new value to the entry
    46.             UnsafeUtility.WriteArrayElement(data->keys, idx, key);
    47.             UnsafeUtility.WriteArrayElement(data->values, idx, item);
    48.  
    49.             return true;
    50.         }
    This is not solution, just a test...

    I'll try revisited example now.
     
  16. bkaradzic-unity3d

    bkaradzic-unity3d

    Unity Technologies

    Joined:
    Jan 17, 2020
    Posts:
    17
    Running either seems to have similar failure rate. But once I apply this change to TryAddAtomic it doesn't happen anymore.
     
  17. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,753
    Still failed for me using the TryAddAtomic you posted.

    upload_2020-5-13_10-23-18.png

    What I've realized is it fails a lot faster for some reason with Entities 0.9.1 and Collections 0.7.1 than Entities 0.10 and Collections 0.8 (but it does still fail eventually with 0.10/0.8)

    Entities 0.10/Collections 0.8 took 90s to 4 minutes for me over 3 tests
    vs
    Entities 0.9.1/Collections 0.7 took 5-20 seconds to fail over a bunch of tests.

    This is a pretty consistent difference.
     
  18. bkaradzic-unity3d

    bkaradzic-unity3d

    Unity Technologies

    Joined:
    Jan 17, 2020
    Posts:
    17
    Update: Figured out repro, and exact issue, but still don't have proper fix for it.
     
    tertle, digitaliliad and eizenhorn like this.
  19. bkaradzic-unity3d

    bkaradzic-unity3d

    Unity Technologies

    Joined:
    Jan 17, 2020
    Posts:
    17
    This is fixed, and it will be available in the next release. Thanks for concise repro case!