Search Unity

  1. Good news ✨ We have more Unite Now videos available for you to watch on-demand! Come check them out and ask our experts any questions!
    Dismiss Notice

DOTS spatial parititioning performance (please help I am stuck)...

Discussion in 'Data Oriented Technology Stack' started by martygrof3708, Apr 26, 2020.

  1. martygrof3708

    martygrof3708

    Joined:
    Dec 2, 2018
    Posts:
    200
    Hello. I have had help from many on this forum, and I have tried your suggestions. They all have their pros and cons, but I can't quite get the performance I need..

    I have been working on many types of 2D spatial partitioning, but I always fall short just a little bit of my goal..

    Context:
    Map in real units is 4-5km squared.
    I have 200k entities (rts units/soldiers)
    These units move in large formations together, but can also sometimes detach, so the map is either empty , sparse, or very dense.

    My Goal: To have a system (or set of systems) to be able to partition units on a 2D map based on their position. Then, use this information to determine circle collisions between units, and also to find the nearest enemy, and store a reference to that enemy for later use. The building of the partition and using the data can be on separate frame, if the grid/map/whatever is able to store a reference to the entities. The map is large, and the number of entities are large, and they are dynamic. Memory usage is ok up to around 6-700 mb.

    Unit colliders are circles only, and only vary in radius twice (men and horses). I do not need full physics sim, but simple collision separation and also mass for when units charge. Accuracy of collision can be sacrificed, as long as units do separate. For example, or the players tell all unit to go to one spot, and they are all bunching, as long as each unit is pushing away from at least 1 other, its ok. But if the units aren't bunched, they should always detect collisions (sparse).

    I have tried:

    NativeMultiHashMap
    Pros: Very soft on memory, 100% parallel, easy to use.
    Problems: After around 50-60k entities it really starts slowing down.

    2D Grid (Array)
    The grids are definitely faster, but use a lot of memory. Smaller cell-sizes mean less checks, but much more memory. I have tried many different ways..

    #1. Buckets in different ways.

    For example, each grid cell can hold 4 entities. So frame 1 partition units, frame 2, do an IJobChunk on all entities checking their cell, where index == unit position. This gives me the most accurate simulation, and allows me to use the partition data for other things. However this is slowed by all the random access to ComponentDataFromEntity accessed by the Entity in the grid cell.

    Or ,I stored indexes into the grid, and these indexes referenced "copyarrays" of componentdata similar to how the Unity boids example did it.

    These approaches saves me memory, as each cell is a maximum of 128 bytes (4 entities or ints), but is slowed down because of it.

    #2. Sacrificing a little accuracy for speed. Each grid cell could only hold one entity, and the grid cell size is halved (cell size is now the radius of an entity, so still pretty accurate).

    I tried this with storing entities, and it is a little faster but not much, because I am still using the entity for random access.

    I tried this by storing the actual positions, and radii of the units, instead of their index. This method achieved the speed I wanted, because I have no more random access to component data arrays. However, this meant I can not use the same partition for unit-enemy detection, and no entity is stored. The grid at this point is storing 3 values (x,y,radius), and takes up nearly .75 gb. So building another grid for entities for unit-enemy will use alot more memory. With this method, I also need to build and read the grid every frame, as I have no entity reference, however this method is still fastest! But it does not allow for the other functionality I need,

    At this point, I am trying to balance memory and performance, and I just cant quite seem to get it.

    #3. Representing by grid as a BitArray. This was the absolute fastest, and saved tons of memory. Each grid cell is one bit. The grid cell sizes were 0.25f, where a unit is 0.5f. I simply checked the 4 "big cells" (2x2, so 16 checks ) a unit was touching and if any bit were on, I approximated a collision saying the center of the colliding entity is the center of the cell. Since I check (cell entity is in, + 3 nearest cells), and each cell is really subdivided into 4 smaller cells (bits), I got a decently accurate result, and vectorized I am only doing 4 checks, not 16.. Again, like grid method #2, I cannot use this for finding nearest enemies.
    More details..

    #4. Representing the grid as DynamicBuffers, but this was slower than a NativeArray.

    SCD

    Separating units based on SCD, but this didn't turn out well..

    More Context

    With the grids, the write is single threaded but just as fast or even faster than the parallel NMHM write. This means in my other threads I can have another array keeping track of the number of entities per bucket. Of course, with my other grid method without buckets, I dont need this counter, which was another speedup.
    This also means for my bucketed approach, I can have 2 grids and write to one while clearing the old one in another thread, and alternating use. Again, not necessary for grid approach #2.

    For the NMHM I write one frame and read the next, and I have multiple maps, use a new one each time and every 8 frames clear them all.

    Overall, the writing to the partition isnt the bottleneck, its reading from it. I can't get rid of random access, while also preserving the two functionalities I need.

    So basically I feel stuck and if anyone has some insight into this, much appreciated.

    I can post code or benchmarks if you want.
     
    Last edited: Apr 26, 2020
    bobbaluba likes this.
  2. Radu392

    Radu392

    Joined:
    Jan 6, 2016
    Posts:
    210
    I'm interested in your benchmarks. I'm not sure what your goals are since you're not providing a baseline of how many ms each method does and what your precise goal is.

    My personal implementation of parallelized pure math RVO and pathfinding system takes about 2ms for 10k units, while computing nearby neighbors takes about 2-10ms depending on the accuracy I want, for a 256x256 grid inside the editor. It's not great, but it's the best I came up with myself. This benchmark wasn't done in the void, so it's not entirely accurate. I'm not sure how it would do for 200k units, though I expect it to be over 30fps.

    Like you, I'm looking into optimizing this part of my code. When it comes to RVO, I believe the job is as optimized as it can get. However, the neighbors computation job takes a long time. I'm using a 2D grid using a NMHM based on this video, with a few optimizations



    Which part of your code slows down after 50k units in your NMHM implementation?

    I'm not sure what your target FPS is, but if you're aiming for 60fps and 200k units all performing RVO... then there's only one solution I can think of: Divide your chunk reading (and maybe writing) across frames. Limit yourself to only computing X chunks per frame. It sounds like your bottleneck is also finding neighbors so this should improve it. You will lose a bit of accuracy, but I just don't think there's any other way to achieve that level of performance for 200k units. How can you do this? One quick way I can think of is to create a few new empty component data and partition your entities with those. For example, entities 0-50k have component data 1, 50k-100k component data 2 etc. You'll then schedule in frame 1 to run on the first chunk, frame 2 on the second chunk and so on. I haven't tried this yet myself, since i have yet to reach that point, but I think it might work. If you're stuck, might as well give it a try.

    Otherwise, you might want to look into particle behavior/flowfield pathing. I personally tried flowfield pathing myself and it seemed like the performance was actually higher, but that had its own issues, the main one being the behavior of the units being unnatural given the context.
     
  3. martygrof3708

    martygrof3708

    Joined:
    Dec 2, 2018
    Posts:
    200
    At 200k entities

    AMD a10 4 cores laptop

    Hashmap:

    - parallel write, 4-6ms.
    - parallel read, 8-20ms (depends on how dense units are)
    Also you cannot vectorize read from hashmap

    Bucketed Array, Storing Entity/QueryIndex
    - single thread write, 4-5ms
    - parallel read, 7-8 ms
    Flexible
    Slowed by indexing of cell, random access

    Bucketed Array, Storing Position Data
    - single thread write, 3-5ms
    - parallel read, 3-4ms
    - RAM 768,000mb
    Must write and read same frame so
    - per frame, ~5-8ms
    (fastest, most accurate but highest memory)
    Cannot be use for closest enemy, but separate system can be made at cost of more memory
    Slowed by indexing of cell

    Single Entity Grid, (each cell stores one entity)
    - single thread write, 2-3ms
    - parallel read, 7-8ms
    - RAM 256,000mb
    Can use for nearby enemy
    No indexing as each cell is only one unit, so vectorizes well, slowed by random access

    Single Data Grid , (each cell stores position/radii data of one entity)
    - single thread write, 3-5ms,
    - parallel read, 2-3 ms,
    - RAM 768,000mb
    Must write and read same frame

    Bit Array, (each cell is a bit, and very small cell sizes)
    - single thread write, 2-4 ms,
    - single thread read, 2-4 ms,
    - RAM 8mb
    Fastest but least accurate.
    Cannot use for other system.
    No indexing as each cell is only one unit, so vectorizes well,
    so I can check 16 cells in 4 computations..
    No random access
    Cannot use for other system
    Other system will be more costly than this, so performance gain is not as strong.
     
    Last edited: Jul 16, 2020
  4. martygrof3708

    martygrof3708

    Joined:
    Dec 2, 2018
    Posts:
    200
    Code Monkey was my first tutorial! But its not good for 200k entities. Also he checks all 8 cells when you only need to check 4. The hashmap read and write slows down after 50k entities. This is because after a threshold the size of your map will start bottle necking you. This is because hash collisions become very frequent, and alot of memory jumping goes on. Also, just to access a cell, you have to perform a hash function, while an array is a straight index. So hundreds of thousands of hash functions..

    I considered dividing the collision between frames, but couldn't figure how to make this unnoticeable to the player. How often does one need to perform a collision check?

    As for particle sim, I looked into it, but that stuff is on GPU and I don't have the skill to write something like that in a reasonable time lol.. Flowfield is great when your maps are small, my map is large.

    I go with a star pathfinding for the formation (80 units), and the units perform a more refined boids/flocking sort of thing to follow the formation. The formations are rectancles, or circles, etc, and each unit has a place from the center, so they move towards the center of formation + offset. So pathfinding isn't an issue for me right now.

    Is it bad to use 1gb of ram for my script, if I am aiming at pc gaming only?
     
  5. martygrof3708

    martygrof3708

    Joined:
    Dec 2, 2018
    Posts:
    200
    The more I think about it the more appealing your idea to split chunks is. I can do this.

    Frame1: Write
    Frame 2&3: Perform collision on chunks
    Frame 4: Find closest enemy for 1st half
    Frame 5&6: Collision
    Frame 7: Find closest enemy for 2nd half
    Frame 8&9: Collision

    Repeat

    This way entities will perform collision every 4th frame
     
    Last edited: Apr 26, 2020
  6. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    5,084
    Have you checked out the BVH implementation in Unity.Physics. It's a very well optimised BVH tree, optimized for cache access and SIMD. Allows for crazy fast incremental & full rebuilds. And has very fast queries for closest / distance / bounding volume check / raycast etc.

    The BVH structure itself deals with bounding volume & indices. So should be possible to simply reference Unity.Physics.BoundingVolumeHierarchy and build your game code data in paralell data structures that drive the updating of the BoundingVolumeHierarchy.

    I am still seriously impressed by this code. It is really tight.


    I am very curious how it fares against your other implementations for this particular use case.


    One extra trick... If you have a lot of soldiers. Those usually move slow. So it can make a lot of sense that your bounding volume hierarchy has nodes that are a couple meters larger than the actual bounds. This way you can skip rebuilding the BVH for most of the soldiers every frame. You basically just do an efficient check of "Are we still inside the bounds" or if you know your data well, it could even be a timeout value based on knowing how fast a unit can maximally travel.
     
  7. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    5,084
    I do agree that flow field generally speaking for the scale your are talking about is a very good choice.
     
    laurentlavigne and martygrof3708 like this.
  8. martygrof3708

    martygrof3708

    Joined:
    Dec 2, 2018
    Posts:
    200
    Thanks for responding! Well I am not familiar with BVH trees, so I'll go ahead and learn about them. The Unity BVH is it's own standalone structure? One concern I have (again, I don't know much about BVH) , is if the bounding volume is large, there will be many checks occurring (can be up to 10-20 per unit) in my game, as units are often densely packed.
     
  9. martygrof3708

    martygrof3708

    Joined:
    Dec 2, 2018
    Posts:
    200
    Well in my case units are not free, they belong to a formation, and the formation is ordered both in shape and ranks. So I need each unit to always move to its specific position in the formation. I have accomplished this with group pathfinding + an offset for each unit. Its very fast..
     
  10. wobes

    wobes

    Joined:
    Mar 9, 2013
    Posts:
    759
    Hey, have you tried Unity.Physics yet?
     
  11. martygrof3708

    martygrof3708

    Joined:
    Dec 2, 2018
    Posts:
    200
    Hi, not too much. I figured out how to lay out my memory using a grid; I was looking in to Unity.Physics, but once I figured out my solution I had no need. I did however just test out 200,000 box colliders using Unity.Physics, but it did not perform well. However, I understand if I maybe just used the BVH tree for simple point distance it might be faster. But my grid solution works pretty well. I made it so that the grid stores an entity in query index, and while writing to the grid in one thread, write to a NativeArray of structs of data from entity.

    The grid is setup in a pattern where the first index is a unit count, and the next 4 indices are the entity indices of the units in the cell. So it is like a 2d list with a max length of 4. (A sacrifice I made where if there are more than 4 units in a cell, some collisions might miss, but the cells are so small this would rarely happen and if it did would not even be noticeable, at the benefit of saving memory).

    Then I loop over the grid and native array of struct data in an IJobParallelFor, and resolve collisions. Then in a parallel IJobChunk I write this array data back to entities.
     
    Last edited: May 11, 2020
  12. wobes

    wobes

    Joined:
    Mar 9, 2013
    Posts:
    759
    May I ask for the measurements result? I am doing similar thing for my networking solution, trying to calculate AOI for 200k entities and 1k clients, NativeMultiHashMap writing is costly. The only way to optimize it is to avoid constant writing. Hashing itself is fast with Burst.
     
  13. martygrof3708

    martygrof3708

    Joined:
    Dec 2, 2018
    Posts:
    200
    At 200k circle collisions at 2d with mass, it takes total about 7-8 ms. The single thread write to the array is 2-3, the collision result is about 3-4, and the write back to entities is about 1-2. Be aware it takes 320,000 mb of RAM. It works in my use case. The hashmap is not a good choice for this amount of entities. The only other way to to use GPU or sacrifice RAM.
     
    wobes likes this.
  14. martygrof3708

    martygrof3708

    Joined:
    Dec 2, 2018
    Posts:
    200
    Are you writing to the map every frame?
     
  15. wobes

    wobes

    Joined:
    Mar 9, 2013
    Posts:
    759
    Yeah, I am thinking of how not to do that every frame and not to clear the map every frame either
     
  16. martygrof3708

    martygrof3708

    Joined:
    Dec 2, 2018
    Posts:
    200
    There are 3 approaches.

    One is to have multiple hashmaps, and write to a new one each frame, and then every 10 frames you skip collision checks and clear all the maps in parallel.

    Another is to write entity to hashmap on one frame, next frame resolve collisions and clear.

    Third is to not clear the hashmap but instead single threaded check which entities moved outside their cells and update their map position. Its actually faster to do this single threaded than in parallel.
     
  17. aganm

    aganm

    Joined:
    Sep 25, 2019
    Posts:
    15
    Which technique is it that you are getting 7-8 ms for 200k circle collisions?
    Also, how much ram it takes? You said 320,000 mb (320gb), you meant 320 mb?
    I'm trying to do the same thing myself but can barely find any info about this.
     
  18. SenseEater

    SenseEater

    Joined:
    Nov 28, 2014
    Posts:
    71
    You sure about that Memory numbers?
     
  19. Nyanpas

    Nyanpas

    Joined:
    Dec 29, 2016
    Posts:
    259
    Quadtree?
     
  20. burningmime

    burningmime

    Joined:
    Jan 25, 2014
    Posts:
    153
    Maybe he's targeting $3000/month EC2 instances for his game.
     
  21. martygrof3708

    martygrof3708

    Joined:
    Dec 2, 2018
    Posts:
    200
    Sorry, I mean 3.2GB, lol. I ended up just using a 2d grid to partition the entities, and sacrificed minor accuracy by making each grid space only recognize up to 4 entities at a time.
     
  22. martygrof3708

    martygrof3708

    Joined:
    Dec 2, 2018
    Posts:
    200
    No, I mean 3.2 GB at max lol
     
    SenseEater likes this.
  23. martygrof3708

    martygrof3708

    Joined:
    Dec 2, 2018
    Posts:
    200
  24. aganm

    aganm

    Joined:
    Sep 25, 2019
    Posts:
    15
    3.2GB sounds like a lot to me.
    If every single of your entities take 1KB of memory, that's only 200mb for 200k entities.
    How big is your grid? Is it something like 10000x10000 ?
     
  25. aganm

    aganm

    Joined:
    Sep 25, 2019
    Posts:
    15
  26. martygrof3708

    martygrof3708

    Joined:
    Dec 2, 2018
    Posts:
    200
    Yea my map is huge, 10km by 10km, but I ended up down-scaling it to about 5x5. I have updated the GIT with comments, if it still doesn't make sense let me know Ill write up like a text doc or something. I'm not sure how much you know about jobs. If you are new, I will be happy to explain more!
     
    burningmime likes this.
  27. martygrof3708

    martygrof3708

    Joined:
    Dec 2, 2018
    Posts:
    200
    Also, the key thing of my system is how I structured the array representing the grid. So for example, here the first 10 indices of the grid represent the first two spatial cells: 3 , 22 , 888 , 87 , -1 , 2 , 1009 , 993 , -1 , -1. Cell #1 contains 3 entities, and these entities indices in the unit query are 22, 888, 87 respectively. Cell #2 contains 2 entities, and their indices are 1009 and 993 respectively.

    The key thing is how the data is structured. This way, when I loop over the CopyPositions array, I can check nearby entities by finding the cell the current position is in, check the nearby closest cells, and then using the cells as indices, I can look into the spatial grid, retrieve all neighbour indices in that cell, and then use those indices into CopyPositions, which will return a position, and then I check if that position + some radius intersects the current position + some radius. I then modify the current position if is collided.

    Then, I can feed CopyPositions back into the entities.
     
  28. burningmime

    burningmime

    Joined:
    Jan 25, 2014
    Posts:
    153
    Did you experiment with swizzling to get the memory layout optimal, or is it grid-based?
    https://demonstrations.wolfram.com/ComparingXYCurvesAndZOrderCurvesForTextureCoding/
    https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-swizzling/
     
  29. martygrof3708

    martygrof3708

    Joined:
    Dec 2, 2018
    Posts:
    200
    The map is a grid. The map array and positions array are not memory aligned. The positions array is aligned with the entity array. In this type of system, there has to be at least one loop with random access, as far as I figured out. I have found it to be fastest to copy the positions of the entities into an array, and randomly access the map array while looping over the copy positions.

    I save alot of cache misses with the way the grid array is structured; I can check the number of entities in a cell, and all the entity indices in one go. Right now, copypositions fits onto l2 cache, so there is only one cache miss per entity (checking the grid). Checking the position of the nearby entity is free, because copypositions is all on cache, and I have the index to the neighbour from the map grid.

    I don't know much about swizziling yet, but I don't think I need it here. Using IJobChunk allows the copy array to be aligned automatically.
     
  30. burningmime

    burningmime

    Joined:
    Jan 25, 2014
    Posts:
    153
    Swizzling isn't too complex it's just this:



    This would change the layout of accesses to the grid (the "hash" you're calculating would be the tile it's in, so nearby units hash to similar numbers). The grid is the part which doesn't fit in cache, right? So you'd reduce the different places in the grid you'd need to access. The rest wouldn't change.

    Anyways, I don't think it would be a big deal unless this is still a performance bottleneck.
     
    martygrof3708 likes this.
  31. martygrof3708

    martygrof3708

    Joined:
    Dec 2, 2018
    Posts:
    200
    That's excellent information, thank you. I will get around to doing this, I think it could be a noticeable speed up. I guess I didn't get what swizzling was lol. Lack of attention on my part!
     
unityunity