Search Unity

Search

Heavily optimizing a* pathfinding algorithm inside a JobSystem.

Discussion in 'Entity Component System' started by MintTree117, Mar 3, 2020.

Page 1 of 2

MintTree117

Joined:

Dec 2, 2018

Posts:

340

Hello all.

I have just finished implementing an algorithm using a Burst-Enabled parallel job. Its quite fast, but not fast enough. Currently 50 units can traverse an empty 100X100 grid in about 16-21 ms, in one frame. I do distribute these requests along multiple frames.

However, there are some key parts of my algorithm which I know are bottlenecks.

1) Inside my job, I currently do a deep copy of a 1-d NativeArray representing the map so I can work with a copy and write to it, with each index being a node with pos, costs, indexes of parent nodes, etc.

I don't how how to represent the nodes without doing the deep copy. It seems to me I would have to either create a bunch of arrays for all the fields of each node, or some other expensive operation anyway.
Basically, I am asking here how I can very cheaply get a representation of my grid/nodes to my jobs, which all run in parallel, so they cannot write to the same representation.

2) I am still using NativeLists for my open/closed sets.

I want to implement a priority queue, however I split my grid into 10x10 sections. I don't think the pq will fare better than my lists with such a small number of nodes. Please correct me if I am wrong.

I am more than happy to post a link to the source code, but its rather long, so let me know.

Also, any other tips/suggestions regarding anything I can do to increase the performance of my job, or in general.

https://github.com/Tree37/A-Pathfinding-Jobs

Thanks.

Last edited: Mar 3, 2020

MintTree117, Mar 3, 2020

#1
DreamingImLatios

Joined:

Jun 3, 2017

Posts:

4,271

1) Only do a deep copy of the data you actually need to mutate.
2) Hard to tell without seeing code.

When trying to optimize a job like this, it can be useful to break up the job into a chain of small jobs and profile each piece. But more importantly, smaller jobs make the Burst inspector a fair bit more readable, so you can pick a job and really drill down performance. If you need help optimizing a particular job using the Burst inspector, post code on GitHub and I can dig into it.

Some things I've learned:
1) Iterating by columns is terrible.
2) Branches are terrible. Avoid && and || and try to express such statements differently to reduce the number of branches Burst generates.
3) Some type of shuffling instruction ever other instruction usually means something is wrong.
4) If you are copying a struct to a new location which is not its final location, you can probably copy an index to the struct's original array instead.

DreamingImLatios, Mar 3, 2020

#2

MintTree117 likes this.
Antypodish

Joined:

Apr 29, 2014

Posts:

10,779

How do you copy array?
Do you really need copy whole array?
What is the purpose of copying, to be able work on it.
Multiple jobs can read from same array at the same time.

Antypodish, Mar 3, 2020

#3

MintTree117 likes this.
palex-nx

Joined:

Jul 23, 2018

Posts:

1,748

Maybe can modify shared copy of the map from multuiple threads using atomic operations https://docs.microsoft.com/en-us/dotnet/api/system.threading.interlocked?view=netframework-4.8

While it adds complexity to the code, the benefit is the absence of locking and copying data

palex-nx, Mar 3, 2020

#4
DreamingImLatios

Joined:

Jun 3, 2017

Posts:

4,271

palex-nx said: ↑

Maybe can modify shared copy of the map from multuiple threads using atomic operations https://docs.microsoft.com/en-us/dotnet/api/system.threading.interlocked?view=netframework-4.8

While it adds complexity to the code, the benefit is the absence of locking and copying data
Click to expand...

I cringe every time I see someone mention atomics as being non-locking. Atomics are not that much faster than mutexes, primarily because mutexes are usually just a pair of atomic instructions. Atomics do absolutely horrible things to cache lines and instruction execution pipes and should be used sparsely if not avoided altogether.

DreamingImLatios, Mar 3, 2020

#5
palex-nx

Joined:

Jul 23, 2018

Posts:

1,748

DreamingImLatios said: ↑

I cringe every time I see someone mention atomics as being non-locking. Atomics are not that much faster than mutexes, primarily because mutexes are usually just a pair of atomic instructions. Atomics do absolutely horrible things to cache lines and instruction execution pipes and should be used sparsely if not avoided altogether.
Click to expand...

Well, what would you suggest then?

palex-nx, Mar 3, 2020

#6
MintTree117

Joined:

Dec 2, 2018

Posts:

340

I have posted my code.

https://github.com/Tree37/A-Pathfinding-Jobs

MintTree117, Mar 3, 2020

#7
MintTree117

Joined:

Dec 2, 2018

Posts:

340
I copy my array like this:

Code (CSharp):

NativeArray<PathNode> pathNodeArrayCopy = new NativeArray<PathNode>( pathNodeArray , Allocator.Temp );

I don't need to copy the whole array, but if I didn't I would need to make a bunch of others arrays to copy data anyway.

The purpose of copying it is so I have data of the grid positions of nodes, their h,g,f costs, and data wether a node is walkable or not.
MintTree117, Mar 3, 2020

#8
MintTree117

Joined:

Dec 2, 2018

Posts:

340

DreamingImLatios said: ↑

1) Only do a deep copy of the data you actually need to mutate.
2) Hard to tell without seeing code.

When trying to optimize a job like this, it can be useful to break up the job into a chain of small jobs and profile each piece. But more importantly, smaller jobs make the Burst inspector a fair bit more readable, so you can pick a job and really drill down performance. If you need help optimizing a particular job using the Burst inspector, post code on GitHub and I can dig into it.

Some things I've learned:
1) Iterating by columns is terrible.
2) Branches are terrible. Avoid && and || and try to express such statements differently to reduce the number of branches Burst generates.
3) Some type of shuffling instruction ever other instruction usually means something is wrong.
4) If you are copying a struct to a new location which is not its final location, you can probably copy an index to the struct's original array instead.
Click to expand...

1) What do you mean by columns?
2) I do have branches but they are part of the a* algorithm.
3) What do you mean by shuffling?
4) How can I do this?

MintTree117, Mar 3, 2020

#9
DreamingImLatios

Joined:

Jun 3, 2017

Posts:

4,271

palex-nx said: ↑

Well, what would you suggest then?
Click to expand...

Rework your code so that two threads don't have to write to the same data address.

martygrof3708 said: ↑

I don't need to copy the whole array, but if I didn't I would need to make a bunch of others arrays to copy data anyway.
Click to expand...

So you have 34 bytes per node, and only 16 of those bytes you actually have to modify. So that's a solid 50% of your memcpy performance right there, possibly even higher. In addition, I don't think you even need to copy the data you write to as much as you could just allocate new empty arrays for them and populate them with data. SoA could help you out a lot here.

martygrof3708 said: ↑

1) What do you mean by columns?
2) I do have branches but they are part of the a* algorithm.
3) What do you mean by shuffling?
4) How can I do this?
Click to expand...

These were general tips with Burst not specific to any particular algorithm. Though I will go through them anyway:
1) If you have a 2D array represented as a 1D array with the index accessed as col * rowWidth + row, and you wanted to iterate using two for loops, you want to have the outer for loop increment columns and the inner loop increment rows. I didn't see any obvious issue with this in your code, but I might have missed a spot.
2) LineOfSight could definitely use some improvement in this regard.
3) Shufflings are simd instructions that reorder the elements in a simd register. I don't think any of your code is using simd at the moment but I would have to dump it into Burst.
4) You are already kinda doing this with your open and closed lists.

Another thing you may want to try is instead of using two lists for your open and closed list, just use a bool array specifying whether something is open or closed. The reason for this is you are doing lots of Contains() calls that you wouldn't need to do if it was a bool check at the index instead. There is a tradeoff of you having to sift out the closed list nodes in GetLowestCostFNodeIndex, but I suspect that tradeoff will be a win.

Try those first and then if you still need more performance we can dive into the asm, although I will need to know your target platform by that point.

DreamingImLatios, Mar 4, 2020

#10

Enzi and MintTree117 like this.
MintTree117

Joined:

Dec 2, 2018

Posts:

340

DreamingImLatios said: ↑

Rework your code so that two threads don't have to write to the same data address.

So you have 34 bytes per node, and only 16 of those bytes you actually have to modify. So that's a solid 50% of your memcpy performance right there, possibly even higher. In addition, I don't think you even need to copy the data you write to as much as you could just allocate new empty arrays for them and populate them with data. SoA could help you out a lot here.

These were general tips with Burst not specific to any particular algorithm. Though I will go through them anyway:
1) If you have a 2D array represented as a 1D array with the index accessed as col * rowWidth + row, and you wanted to iterate using two for loops, you want to have the outer for loop increment columns and the inner loop increment rows. I didn't see any obvious issue with this in your code, but I might have missed a spot.
2) LineOfSight could definitely use some improvement in this regard.
3) Shufflings are simd instructions that reorder the elements in a simd register. I don't think any of your code is using simd at the moment but I would have to dump it into Burst.
4) You are already kinda doing this with your open and closed lists.

Another thing you may want to try is instead of using two lists for your open and closed list, just use a bool array specifying whether something is open or closed. The reason for this is you are doing lots of Contains() calls that you wouldn't need to do if it was a bool check at the index instead. There is a tradeoff of you having to sift out the closed list nodes in GetLowestCostFNodeIndex, but I suspect that tradeoff will be a win.

Try those first and then if you still need more performance we can dive into the asm, although I will need to know your target platform by that point.
Click to expand...

Thank you very much for your help. I am implementing now.

MintTree117, Mar 4, 2020

#11
MintTree117

Joined:

Dec 2, 2018

Posts:

340

DreamingImLatios said: ↑

Rework your code so that two threads don't have to write to the same data address.

So you have 34 bytes per node, and only 16 of those bytes you actually have to modify. So that's a solid 50% of your memcpy performance right there, possibly even higher. In addition, I don't think you even need to copy the data you write to as much as you could just allocate new empty arrays for them and populate them with data. SoA could help you out a lot here.

These were general tips with Burst not specific to any particular algorithm. Though I will go through them anyway:
1) If you have a 2D array represented as a 1D array with the index accessed as col * rowWidth + row, and you wanted to iterate using two for loops, you want to have the outer for loop increment columns and the inner loop increment rows. I didn't see any obvious issue with this in your code, but I might have missed a spot.
2) LineOfSight could definitely use some improvement in this regard.
3) Shufflings are simd instructions that reorder the elements in a simd register. I don't think any of your code is using simd at the moment but I would have to dump it into Burst.
4) You are already kinda doing this with your open and closed lists.

Another thing you may want to try is instead of using two lists for your open and closed list, just use a bool array specifying whether something is open or closed. The reason for this is you are doing lots of Contains() calls that you wouldn't need to do if it was a bool check at the index instead. There is a tradeoff of you having to sift out the closed list nodes in GetLowestCostFNodeIndex, but I suspect that tradeoff will be a win.

Try those first and then if you still need more performance we can dive into the asm, although I will need to know your target platform by that point.
Click to expand...

Hi Latios. So I implemented your changes, and updated the Git

No more closed list. Only data which I really need to mutate is copied. I moved the NativeLists that were in the update loop to OnStartRunning(), and I just Clear them every update.

Right now, without the path smoothing:

On a 10x10: 1 unit = 0.6ms , 20 units = 2-4 ms
On a 50x50: 1 unit = ~1ms , 20 units = 3-5ms

I know that with a modern CPU these time would be cut way more than half, which is pretty good. But, I really want to try to get it even lower, lol.

I am going to watch the Unite talk about SIMD, and try to use it here if possible at all. I know what it is, but I have zero experience with it.

If possible I am trying to get it to 0.1ms time.

I am targeting Windows platform right now. These timings are all in editor. AMD A10 8700, 8 gb DDR3. Job debugger on.

Last edited: Mar 4, 2020

MintTree117, Mar 4, 2020

#12
MintTree117

Joined:

Dec 2, 2018

Posts:

340

Antypodish said: ↑

How do you copy array?
Do you really need copy whole array?
What is the purpose of copying, to be able work on it.
Multiple jobs can read from same array at the same time.
Click to expand...

I have eliminated most of my copies and it helped, thank you.

MintTree117, Mar 4, 2020

#13

Antypodish likes this.
DreamingImLatios

Joined:

Jun 3, 2017

Posts:

4,271

martygrof3708 said: ↑

On a 10x10: 1 unit = 0.6ms , 20 units = 2-4 ms
On a 50x50: 1 unit = ~1ms , 20 units = 3-5ms
Click to expand...

These numbers mean nothing if you don't know how fast the original was with the same grid size and unit count.

Besides measuring the existing improvements so we can tell how bandwidth-bound vs CPU bound we are, the next thing worth doing is commenting out code from the end of the job and running the job. The job results become meaningless, but the profiling time differences with or without the code commented out can give us an idea which blocks are the most expensive. Pick an expensive block and I can walk you through the X86 assembly for that part.

DreamingImLatios, Mar 4, 2020

#14

Enzi likes this.
MintTree117

Joined:

Dec 2, 2018

Posts:

340

Ok. Here are the timings starting with the first region, then adding each of the other 3.
Note the gridNodeArray is always 100x100, its just for the tests I clicked within 10x10 and 50x50

DataSetup:
- 1 unit 10x10 is 0.3-0.7ms
- 1 unit 50x50 is same
- 20 units 10x10 is ~2.6ms
- 20 units 50x50 is same

DataSetup + FindPath
- 0.3 - 0.7ms
- 0.3 - 0.7ms
- 2 - 5ms
- 18 - 22ms

DataSetup + FindPath + TracePath
- 0.4 - 0.7ms
- 3 - 3.4 ms
- 20-30ms
- 20-30ms

Also, is SoA service-oriented architecture?

MintTree117, Mar 4, 2020

#15
eizenhorn

Joined:

Oct 17, 2016

Posts:

2,685

martygrof3708 said: ↑

Also, is SoA service-oriented architecture?
Click to expand...

Structure Of Arrays

eizenhorn, Mar 4, 2020

#16

MintTree117 likes this.
DreamingImLatios

Joined:

Jun 3, 2017

Posts:

4,271

Ok. It seems there are a couple of things you are struggling with. Its going to be a really slow process for me to help you if I can't run the benchmarks myself. If you can set things up such that your single system in OnUpdate instantiates the entities and grid, runs the benchmark for a couple different combinations, and then tears things down again, I can drop that into my test suite and really dig in.

After all, optimization is really a game of educated guess and check. If you aren't doing the checking, you aren't optimizing.

DreamingImLatios, Mar 4, 2020

#17

andrew-lukasik, MintTree117 and Enzi like this.
andrew-lukasik

Joined:

Jan 31, 2013

Posts:

249

Hi. Small tip from me here - optimization is not all about IJobs and Burst. In my limited experience it's important to visualize your input&output data and simply measure how many nodes your given implementation tests while solving for path.
For example, compare results of these two implementations:
A https://i.imgur.com/np3guuA.gif
B https://i.imgur.com/vW5bVeQ.gif
I was stuck there writing memory optimizations (/diminishing returns) but time investment put in visualizing all this this helped me spot what must be really wrong in A. Thanks to that B is 10 times faster - just by fixing heuristic function, tweaking it's arguments and also changing input data (under reported factor imo) a bit (less slopes). All this thanks to humble visualization window (src).

Last edited: Mar 4, 2020

andrew-lukasik, Mar 4, 2020

#18

YakShaver_dc, Enzi, DreamingImLatios and 2 others like this.
MintTree117

Joined:

Dec 2, 2018

Posts:

340

DreamingImLatios said: ↑

Ok. It seems there are a couple of things you are struggling with. Its going to be a really slow process for me to help you if I can't run the benchmarks myself. If you can set things up such that your single system in OnUpdate instantiates the entities and grid, runs the benchmark for a couple different combinations, and then tears things down again, I can drop that into my test suite and really dig in.

After all, optimization is really a game of educated guess and check. If you aren't doing the checking, you aren't optimizing.
Click to expand...

Ok, I've uploaded a bench-marking class. Just import the regiment component and grid class as well (they are small).

You can play with the Constants at the top of the Benchmarking class to change how many units to spawn, how many should find a path, and the grid size. As well, you can change the pathfinding desination. To run the benchmark in playmode, just press "Space".

https://github.com/Tree37/A-Pathfinding-Jobs

To test parts of the job, you can just comment out the different #regions in the job. The only thing you have to do to make it work is always return a NativeList, even an empty one. So just return a NativeList at the end.

Last edited: Mar 4, 2020

MintTree117, Mar 4, 2020

#19
MintTree117

Joined:

Dec 2, 2018

Posts:

340

andrew-lukasik said: ↑

Hi. Small tip from me here - optimization is not all about IJobs and Burst. In my limited experience it's important to visualize your input&output data and simply measure how many nodes your given implementation tests while solving for path.
For example, compare results of these two implementations:
A https://i.imgur.com/np3guuA.gif
B https://i.imgur.com/vW5bVeQ.gif
I was stuck there writing memory optimizations (/diminishing returns) but time investment put in visualizing all this this helped me spot what must be really wrong in A. Thanks to that B is 10 times faster - just by fixing heuristic function, tweaking it's arguments and also changing input data (under reported factor imo) a bit (less slopes). All this thanks to humble visualization window (src).
Click to expand...

What changes did you make to your heuristic? What did you change about your input data?

MintTree117, Mar 4, 2020

#20
MintTree117

Joined:

Dec 2, 2018

Posts:

340

DreamingImLatios said: ↑

Ok. It seems there are a couple of things you are struggling with. Its going to be a really slow process for me to help you if I can't run the benchmarks myself. If you can set things up such that your single system in OnUpdate instantiates the entities and grid, runs the benchmark for a couple different combinations, and then tears things down again, I can drop that into my test suite and really dig in.

After all, optimization is really a game of educated guess and check. If you aren't doing the checking, you aren't optimizing.
Click to expand...

Ok, wow!

I simply removed a Calculate_HCost function from inside the neighbour check that I missed. It wasn't supposed to be there. Now 20 units only take 0.5ms to traverse the whole grid!

I also edited my heuristic thanks to
andrew-lukasik

This is a massive boost. I don't know if I should even go through the trouble of writing a native container for a priority qeue anymore, but if you think it will make it run even faster I'll try.

I was thinking about pre-computing the heuristic for every cell to every other cell and storing it, so that I don't have to do this every time I find a path. This will remove Calculate_HCost n-times where n is the number of nodes.

However, I don't know if its a good idea to store a NativeArray inside a struct? When I pass in the pathnode array to the job, am I passing a pointer or a copy?

Last edited: Mar 5, 2020

MintTree117, Mar 5, 2020

#21
DreamingImLatios

Joined:

Jun 3, 2017

Posts:

4,271

martygrof3708 said: ↑

I simply removed a Calculate_HCost function from inside the neighbour check that I missed. It wasn't supposed to be there. Now 20 units only take 0.5ms to traverse the whole grid!

I also edited my heuristic thanks to
andrew-lukasik

This is a massive boost. I don't know if I should even go through the trouble of writing a native container for a priority qeue anymore, but if you think it will make it run even faster I'll try.
Click to expand...

Not doing work that you don't need to do is always a good thing! I'm not a pathfinding expert, so I am really glad @andrew-lukasik was able to pitch in. I wouldn't bother with the priority queue for now. There's probably a bunch of other lower-hanging fruits to tackle first.

martygrof3708 said: ↑

I was thinking about pre-computing the heuristic for every cell to every other cell and storing it, so that I don't have to do this every time I find a path. This will remove Calculate_HCost n-times where n is the number of nodes.

However, I don't know if its a good idea to store a NativeArray inside a struct? When I pass in the pathnode array to the job, am I passing a pointer or a copy?
Click to expand...

This is absolutely an optimization worth doing! It is perfectly fine to pack several arrays into a struct. I do it all the time for my acceleration structures. That's what leads to SoA layouts, which I still think is going to help you a lot.

martygrof3708 said: ↑

Currently 50 units can traverse an empty 100X100 grid in about 16-21 ms, in one frame.
Click to expand...

martygrof3708 said: ↑

If possible I am trying to get it to 0.1ms time.
Click to expand...

Am I correlating these numbers correctly? It might just be me being optimistic, but I actually think this is attainable. Anyways, if you decide you want to keep pushing this algorithm further, keep pushing to that repo and I'll get a test harness going that will tabulate all the optimizations and help track future ones.

DreamingImLatios, Mar 5, 2020

#22
Dale_Kim

Unity Technologies

Joined:

May 8, 2019

Posts:

56

Your code suffers from very serious algorithmic scaling issues. https://github.com/Tree37/A-Pathfin...6f045dea464/ParrallelPathfinding.cs#L164-L178 and https://github.com/Tree37/A-Pathfin...6f045dea464/ParrallelPathfinding.cs#L405-L417 both perform linear searches which forces your algorithm to run in quadratic time. See https://en.wikipedia.org/wiki/Dijkstra's_algorithm#Running_time

You need to use a priority queue to speed up the minimum F node extraction and you shouldn't search a list to determine if a node is open or closed. Use flags and toggle flags instead.

Last edited: Mar 5, 2020

Dale_Kim, Mar 5, 2020

#23

YakShaver_dc, andrew-lukasik, JoNax97 and 1 other person like this.
MintTree117

Joined:

Dec 2, 2018

Posts:

340

DreamingImLatios said: ↑

Not doing work that you don't need to do is always a good thing! I'm not a pathfinding expert, so I am really glad @andrew-lukasik was able to pitch in. I wouldn't bother with the priority queue for now. There's probably a bunch of other lower-hanging fruits to tackle first.

This is absolutely an optimization worth doing! It is perfectly fine to pack several arrays into a struct. I do it all the time for my acceleration structures. That's what leads to SoA layouts, which I still think is going to help you a lot.

Am I correlating these numbers correctly? It might just be me being optimistic, but I actually think this is attainable. Anyways, if you decide you want to keep pushing this algorithm further, keep pushing to that repo and I'll get a test harness going that will tabulate all the optimizations and help track future ones.
Click to expand...

Yes, but I was being overly optimistic a little. Even if they can do that in 1 ms thats great, because then I can have potentially hundreds doing this every second.

I have noticed earlier I did a benchmark just after I started my laptop, and the timings we an order of magnitude lower than my previous tests, but my cpu runs hot very quick and the performance goes down so I can't really get realistic results.

However, the same thing seems to happen when I only run the benchmarker, without any rendering or any other systems. This confuses me.

When I try to put a NativeArray into my GridPathNode struct, I get this error:

GridPathNode used in NativeArray<GridPathNode> must be unmanaged (contain no managed types) and cannot itself be a native container type.

Also, I am not sure where to start with SoA, as I'm unfamiliar.

In addition, I have removed the membership-test loops for the open set by have each PathNode contain both a bool field for isOpen and an int field for its index in the openset. This yielded some small performance gains, but again, its not really accurate on my laptop.

I still have to optimize the path smoothing.

I will push to the repo within 20 or so min.

Last edited: Mar 5, 2020

MintTree117, Mar 5, 2020

#24
Dale_Kim

Unity Technologies

Joined:

May 8, 2019

Posts:

56

I have a private sample from a while ago that I worked on which can pathfind ~50 units on a ~100x100 grid in about 0.5 ms (total CPU time) on my 2018 MacBook Pro with 6 cores @ 2.9 ghz.

Scaling it up to 500x500 with around 50 units is about 15 ms total!

There are much faster methods to pathfinding that doesn't require searching a graph which is *even faster* by huge factors (a coworker of mine implemented it but I don't think it's public).

Dale_Kim, Mar 5, 2020

#25

pal_trefall, andrew-lukasik and MintTree117 like this.
MintTree117

Joined:

Dec 2, 2018

Posts:

340

Dale_Kim said: ↑

I have a private sample from a while ago that I worked on which can pathfind ~50 units on a ~100x100 grid in about 0.5 ms (total CPU time) on my 2018 MacBook Pro with 6 cores @ 2.9 ghz.

View attachment 571849

Scaling it up to 500x500 with around 50 units is about 15 ms total!

View attachment 571852

There are much faster methods to pathfinding that doesn't require searching a graph which is *even faster* by huge factors (a coworker of mine implemented it but I don't think it's public).
Click to expand...

Very cool! Thanks for sharing.

MintTree117, Mar 5, 2020

#26
MintTree117

Joined:

Dec 2, 2018

Posts:

340

Dale_Kim said: ↑

Your code suffers from very serious algorithmic scaling issues. https://github.com/Tree37/A-Pathfin...6f045dea464/ParrallelPathfinding.cs#L164-L178 and https://github.com/Tree37/A-Pathfin...6f045dea464/ParrallelPathfinding.cs#L405-L417 both perform linear searches which forces your algorithm to run in quadratic time. See https://en.wikipedia.org/wiki/Dijkstra's_algorithm#Running_time

You need to use a priority queue to speed up the minimum F node extraction and you should search a list to determine if a node is open or closed. Use flags and toggle flags instead.
Click to expand...

Oh I just saw this, sorry. I actually just removed the membership tests with the flags as you said. The only thing I didn't do yet it the priority queue. The git is updated.

The benchmarker gets around the same results as you posted. But when I add in the rendering and other systems the time goes up 1 order of magnitude, and I don't understand why.

Could it be because my entities in my normal program have alot more data attached to them vs in the benchmarker? The find path jobs are exactly the same though.

Last edited: Mar 5, 2020

MintTree117, Mar 5, 2020

#27
DreamingImLatios

Joined:

Jun 3, 2017

Posts:

4,271

martygrof3708 said: ↑

The benchmarker gets around the same results as you posted. But when I add in the rendering and other systems the time goes up 1 order of magnitude, and I don't understand why.

Could it be because my entities in my normal program have alot more data attached to them vs in the benchmarker? The find path jobs are exactly the same though.
Click to expand...

Best way to find out is to look at the Profiler Timeline view and compare in game vs benchmark test. Share screenshots if you feel inclined.

DreamingImLatios, Mar 5, 2020

#28
MintTree117

Joined:

Dec 2, 2018

Posts:

340

DreamingImLatios said: ↑

Best way to find out is to look at the Profiler Timeline view and compare in game vs benchmark test. Share screenshots if you feel inclined.
Click to expand...

Real:

Benchmark:

I have no idea why this is happening. The only difference between the two are that in the normal version, there is a large cube for a ground, and each unity has 16 sub-units each that are being rendered. The pathfinding does not run on the sub-units.

MintTree117, Mar 5, 2020

#29
DreamingImLatios

Joined:

Jun 3, 2017

Posts:

4,271

I always forget that you can actually send a profile capture to someone and they can open it even without the project. You may want to try that. I need to see the worker threads.

DreamingImLatios, Mar 5, 2020

#30
MintTree117

Joined:

Dec 2, 2018

Posts:

340

DreamingImLatios said: ↑

I always forget that you can actually send a profile capture to someone and they can open it even without the project. You may want to try that. I need to see the worker threads.
Click to expand...

The files are too large to post here, I pushed them to the repo.

MintTree117, Mar 5, 2020

#31
Sarkahn

Joined:

Jan 9, 2013

Posts:

440

Hey, if you're interested I've been working on a priority queue for the past few days.
https://github.com/sarkahn/nativecontainers/tree/master/Assets/NativeContainers

It's based off this one I found but I modified it to be job-safe, generic, and to use an UnsafeList internally so it will automatically increase capacity as you enqueue nodes. If you're not familiar with writing job-safe native containers it can be annoying so feel free to use it or build off it if you want.

Thanks for sharing your pathfinding journey!

Sarkahn, Mar 5, 2020

#32

YakShaver_dc, Krajca and MintTree117 like this.
MintTree117

Joined:

Dec 2, 2018

Posts:

340

Sarkahn said: ↑

Hey, if you're interested I've been working on a priority queue for the past few days.
https://github.com/sarkahn/nativecontainers/tree/master/Assets/NativeContainers

It's based off this one I found but I modified it to be job-safe, generic, and to use an UnsafeList internally so it will automatically increase capacity as you enqueue nodes. If you're not familiar with writing job-safe native containers it can be annoying so feel free to use it or build off it if you want.

Thanks for sharing your pathfinding journey!
Click to expand...

Hey thanks alot! I am trying to learn them, and I will use yours to learn as well. If I wanted to use a fixed array instead of a list, is the type void* or int*? I am not familiar with c#/unity pointers yet.

It looks like I knew less than I thought I did, lol.

Last edited: Mar 5, 2020

MintTree117, Mar 5, 2020

#33
Sarkahn

Joined:

Jan 9, 2013

Posts:

440

Either should work I think, you can use UnsafeUtility to perform all your array operations on your pointer.

Jackson Dunstan has a really nice repo and a detailed set of articles about custom nativecontainers, highly recommended.

https://jacksondunstan.com/articles/4734

Sarkahn, Mar 5, 2020

#34

MintTree117 likes this.
MintTree117

Joined:

Dec 2, 2018

Posts:

340

Dale_Kim said: ↑

Your code suffers from very serious algorithmic scaling issues. https://github.com/Tree37/A-Pathfin...6f045dea464/ParrallelPathfinding.cs#L164-L178 and https://github.com/Tree37/A-Pathfin...6f045dea464/ParrallelPathfinding.cs#L405-L417 both perform linear searches which forces your algorithm to run in quadratic time. See https://en.wikipedia.org/wiki/Dijkstra's_algorithm#Running_time

You need to use a priority queue to speed up the minimum F node extraction and you should search a list to determine if a node is open or closed. Use flags and toggle flags instead.
Click to expand...

If I use a priority queue, extraction is faster, but insertion is slower.

MintTree117, Mar 5, 2020

#35
Micz84

Joined:

Jul 21, 2012

Posts:

451

Some time ago I was implementing a flow field in DOTS and big optimisation I got from precalculating all valid neighbours. I have done it like this:
I have created a MultiHashmap that has int as a key and int as values. A key represents all possible combinations of walkable neighbours as shown in the image below (black means unwalkable node):

I values I store an index to the direction vector.
Direction vector array has 8 int2 vectors [(-1,-1),(0,-1),(1,-1)...]
So this is how the hash map looks like [key]<values>:
[0]<0,1,2,3,4,5,6,7>
[1]<1,2,3,4,5,6,7>
...
[73]<1,2,4,5,7>
...
Each map node has its code calculated during the generation phase. This way I do not have to check for walkability and borders of the map. Maybe this will give you some performance too.
In a situation like this:

Node gets code 14 instead of 10. 10 is not needed in my hashmap because there will be no node with this code.

Micz84, Mar 5, 2020

#36

YakShaver_dc, EduardoLauer and MintTree117 like this.
MintTree117

Joined:

Dec 2, 2018

Posts:

340

Micz84 said: ↑

Some time ago I was implementing a flow field in DOTS and big optimisation I got from precalculating all valid neighbours. I have done it like this:
I have created a MultiHashmap that has int as a key and int as values. A key represents all possible combinations of walkable neighbours as shown in the image below (black means unwalkable node):
View attachment 572122
I values I store an index to the direction vector.
Direction vector array has 8 int2 vectors [(-1,-1),(0,-1),(1,-1)...]
So this is how the hash map looks like [key]<values>:
[0]<0,1,2,3,4,5,6,7>
[1]<1,2,3,4,5,6,7>
...
[73]<1,2,4,5,7>
...
Each map node has its code calculated during the generation phase. This way I do not have to check for walkability and borders of the map. Maybe this will give you some performance too.
In a situation like this:
View attachment 572119
Node gets code 14 instead of 10. 10 is not needed in my hashmap because there will be no node with this code.
Click to expand...

This is a nice optimization, didn't think of this. Thanks for sharing.

I want to implement a flow field as well. Do you have any references I could learn from?

MintTree117, Mar 5, 2020

#37
Micz84

Joined:

Jul 21, 2012

Posts:

451
martygrof3708 said: ↑

This is a nice optimization, didn't think of this. Thanks for sharing.
Click to expand...

Thanks . When I have started to look for optimisations I wanted to get rid of branching. Checking for walls and out of bounds area were most common.

martygrof3708 said: ↑

I want to implement a flow field as well. Do you have any references I could learn from?
Click to expand...

I have watched a talk about porting to jobs. I can't find it now maybe someone else will have a link to it.
During the talk, the factory example from this repository was taken as base. It was jobified. I used this jobified version as my base and move it to DOTS. Additionally, I have tried to optimise it as much I could. Probably something more can be done. For map 200x200 and about 30% are walls it takes 2-3 ms to generate flowfield. Unfortunately, the cost field can't be done in parallel, and it is the most expensive part.
Here is my cost field job:

Code (CSharp):

private struct GenerateCostField : IJob

{

[ReadOnly] public MapData map;

public NativeArray<float> stepField;

public NativeList<int2> openSet;

public NativeList<int2> nextSet;

[ReadOnly] public NativeMultiHashMap<int, int> directions;

[ReadOnly] public NativeArray<int2> moveDirs;

[ReadOnly] public NativeArray<float> moveDirsLength;

[ReadOnly] public int mapWidth;

public void Execute()

{

while (openSet.Length > 0)

{

// repeat until out of tiles

for (var j = 0; j < openSet.Length; j++)

{

var tile = openSet[j];

var x = tile.x;

var y = tile.y;

var index = x + y * mapWidth;

var existingCost = stepField[index];

var dirs = directions.GetValuesForKey(map.tiles[index]

.availableDirectionsCode);

while (dirs.MoveNext())

{

var dirIndex = dirs.Current;

var moveDir = moveDirs[dirIndex];

var newTile = new int2(x + moveDir.x, y + moveDir.y);

var mapTile = map[newTile];

var moveCost = mapTile.moveCost + mapTile.agents * 0.1f;

var newIndex = newTile.x + newTile.y * mapWidth;

var cost = moveCost * moveDirsLength[dirIndex] + existingCost;

if (!(cost < stepField[newIndex])) continue;

stepField[newIndex] = cost;

nextSet.Add(newTile);

}

}

var temp = openSet;

openSet = nextSet;

nextSet = temp;

nextSet.Clear();

}

}

}

Probably I could use index to tiles array insted of int2 coordinate, and then in MultiHashMap instead of index to int2[] array of direction store an offset of an index. It is known during generation of this has map. Maybe I will try this approach. In my implementaion number of agents effects movement cost for a tile.
Micz84, Mar 5, 2020

#38
DreamingImLatios

Joined:

Jun 3, 2017

Posts:

4,271

martygrof3708 said: ↑

The files are too large to post here, I pushed them to the repo.
Click to expand...

I'm not seeing massive spikes like what you showed. Any spikes I do see, the jobs are running without Burst enabled. Remember that without SynchronousCompilation, the first few times the job runs it will run without Burst while the Burst Compiler compiles it.

DreamingImLatios, Mar 5, 2020

#39
MintTree117

Joined:

Dec 2, 2018

Posts:

340

DreamingImLatios said: ↑

I'm not seeing massive spikes like what you showed. Any spikes I do see, the jobs are running without Burst enabled. Remember that without SynchronousCompilation, the first few times the job runs it will run without Burst while the Burst Compiler compiles it.
Click to expand...

Well... The slower one wasn't BurstCompiled.

I couldn't stop laughing. Everything is less than 1 ms now, I think we have achieved the goal. Thank you again for your help! I'm still going to implement a priority queue and do a few other small optimizations and I'll post the final results whenever I finish if anyone is interested.

MintTree117, Mar 5, 2020

#40
Dale_Kim

Unity Technologies

Joined:

May 8, 2019

Posts:

56

martygrof3708 said: ↑

If I use a priority queue, extraction is faster, but insertion is slower.
Click to expand...

Insertion into a priority queue will always be slower compared to appending to the back of list, but insertion is not the critical operation in A*, extracting the minimum is. I would expect that using flags for the closed set and a priority queue for open set to speed up your search dramatically. If it isn't, it's likely an implementation specific issue where you're copying buffers excessively or your heuristic is too conservative and doesn't head toward the destination correctly (which causes the search frontier, aka the open set, to be much larger than it should be).

Dale_Kim, Mar 5, 2020

#41
MintTree117

Joined:

Dec 2, 2018

Posts:

340

Dale_Kim said: ↑

Insertion into a priority queue will always be slower compared to appending to the back of list, but insertion is not the critical operation in A*, extracting the minimum is. I would expect that using flags for the closed set and a priority queue for open set to speed up your search dramatically. If it isn't, it's likely an implementation specific issue where you're copying buffers excessively or your heuristic is too conservative and doesn't head toward the destination correctly (which causes the search frontier, aka the open set, to be much larger than it should be).
Click to expand...

Yea I was mistaken. I am adding in the priority queue and already have flags.

MintTree117, Mar 5, 2020

#42
MintTree117

Joined:

Dec 2, 2018

Posts:

340

Sarkahn said: ↑

Hey, if you're interested I've been working on a priority queue for the past few days.
https://github.com/sarkahn/nativecontainers/tree/master/Assets/NativeContainers

It's based off this one I found but I modified it to be job-safe, generic, and to use an UnsafeList internally so it will automatically increase capacity as you enqueue nodes. If you're not familiar with writing job-safe native containers it can be annoying so feel free to use it or build off it if you want.

Thanks for sharing your pathfinding journey!
Click to expand...

How do I pass data into the queue? For example, if I just wanted the queue to store a cost and an index, and return me the index when I dequeue?

MintTree117, Mar 5, 2020

#43
thebanjomatic

Joined:

Nov 13, 2016

Posts:

36

Sarkahn said: ↑

Hey, if you're interested I've been working on a priority queue for the past few days.
https://github.com/sarkahn/nativecontainers/tree/master/Assets/NativeContainers

It's based off this one I found but I modified it to be job-safe, generic, and to use an UnsafeList internally so it will automatically increase capacity as you enqueue nodes. If you're not familiar with writing job-safe native containers it can be annoying so feel free to use it or build off it if you want.

Thanks for sharing your pathfinding journey!
Click to expand...

I just tried dropping this into my A* test project. I was curious because in my own test cases, the linear array search still outperforms my priority-queue based solution. I was hoping yours might have been faster than mine, but the results I'm seeing aren't adding up to my expectations though.

1000x1000 grid, no obstacles. 50 path searches over 8 cores
Search goes from 0,0 to 999,999 and the maximum size of the open list over the course of the search hits 395 items.

Linear array search: 1.5ms
My MinHeap with index map to accelerate updating priorites: 2.35ms
NativeContainers.NativePriorityQueue: 45ms

I'm still not sure why either approach should be slower than the linear case, but who knows.

thebanjomatic, Mar 5, 2020

#44
Sarkahn

Joined:

Jan 9, 2013

Posts:

440

martygrof3708 said: ↑

How do I pass data into the queue? For example, if I just wanted the queue to store a cost and an index, and return me the index when I dequeue?
Click to expand...

It's a generic container so just create a struct with whatever you want to store and use that or just a plain unmanaged value as the generic parameter. The node you get back when you dequeue stores your data in Node.value.

thebanjomatic said: ↑

I just tried dropping this into my A* test project. I was curious because in my own test cases, the linear array search still outperforms my priority-queue based solution. I was hoping yours might have been faster than mine, but the results I'm seeing aren't adding up to my expectations though.

1000x1000 grid, no obstacles. 50 path searches over 8 cores
Search goes from 0,0 to 999,999 and the maximum size of the open list over the course of the search hits 395 items.

Linear array search: 1.5ms
My MinHeap with index map to accelerate updating priorites: 2.35ms
NativeContainers.NativePriorityQueue: 45ms

I'm still not sure why either approach should be slower than the linear case, but who knows.
Click to expand...

I'm not sure tbh, like I said it's based off a C# version I found so Im not too familiar with it and I haven't had a chance to test it in an algorithm and profile it yet. Only thing I could suggest is to make sure you set the initial capacity high enough so it doesnt need to resize while working. And are you sure you ran it in burst? That seems unusually slow

Sarkahn, Mar 5, 2020

#45
DreamingImLatios

Joined:

Jun 3, 2017

Posts:

4,271

thebanjomatic said: ↑

I'm still not sure why either approach should be slower than the linear case, but who knows.
Click to expand...

Try a min-heap that instead of using a b-tree uses a larger number of leaves per node.

DreamingImLatios, Mar 6, 2020

#46
Dale_Kim

Unity Technologies

Joined:

May 8, 2019

Posts:

56

thebanjomatic said: ↑

I just tried dropping this into my A* test project. I was curious because in my own test cases, the linear array search still outperforms my priority-queue based solution. I was hoping yours might have been faster than mine, but the results I'm seeing aren't adding up to my expectations though.

1000x1000 grid, no obstacles. 50 path searches over 8 cores
Search goes from 0,0 to 999,999 and the maximum size of the open list over the course of the search hits 395 items.

Linear array search: 1.5ms
My MinHeap with index map to accelerate updating priorites: 2.35ms
NativeContainers.NativePriorityQueue: 45ms

I'm still not sure why either approach should be slower than the linear case, but who knows.
Click to expand...

There is no way to know for sure without your code. Can you share it?

Dale_Kim, Mar 6, 2020

#47
MintTree117

Joined:

Dec 2, 2018

Posts:

340

Dale_Kim said: ↑

There is no way to know for sure without your code. Can you share it?
Click to expand...

I just implemented this priority queue:

https://github.com/sarkahn/nativecontainers/tree/master/Assets/NativeContainers

And my timings went up by more than double?

MintTree117, Mar 6, 2020

#48
DreamingImLatios

Joined:

Jun 3, 2017

Posts:

4,271

martygrof3708 said: ↑

I just implemented this priority queue:

https://github.com/sarkahn/nativecontainers/tree/master/Assets/NativeContainers

And my timings went up by more than double?
Click to expand...

It is not a very fast priority queue. In addition, there's always some threshold where below which a linear algorithm is going to beat a fancier data structure. Slower implementations of the fancier data structure will have a much higher threshold. You are well-below that threshold right now.

Unless you are not satisfied with the performance yet, it may be better worth your time to take a break and focus on the rest of your game. It'll be a good problem to revisit with a little more DOTS practice and more aggressive requirements to meet. I've been crunching pretty hard to meet a personal deadline this month. But after that, I would love to take a real shot at this algorithm, though someone may need to remind me!

DreamingImLatios, Mar 7, 2020

#49

Krajca and MintTree117 like this.
thebanjomatic

Joined:

Nov 13, 2016

Posts:

36

Dale_Kim said: ↑

There is no way to know for sure without your code. Can you share it?
Click to expand...

@Dale_Kim
I had to refactor things to upload when I got home from work. This project should demonstrate the difference:
https://github.com/thebanjomatic/TestPathfinding

I changed things around so that I now do a search from (0,0) to (999, 0) with a single wall in the middle from (500, 0) to (500, 998).

ArrayOpenList 4.52ms
MinHeapOpenList: 5.68ms

I haven't done an exhaustive analysis of the minheap implementation, but it appears to be working correctly.

thebanjomatic, Mar 7, 2020

#50

(You must log in or sign up to reply here.)

Page 1 of 2