2D Array access optimisation and vectorisation

threedots1 · Apr 28, 2021

Just looking into optimising reading and processing my map/navigation graphs.

These are currently 1D arrays of bit packed integers accessed by index calculated by the standard methods:

Code (CSharp):

int index = y * width + x;

And converting the index back to a cell is done by:

Code (CSharp):

int2 cell = new int2(index % width, index / width);

In my current naive implementation the flood fill and pathfinding algorithms process a single neighbour at a time. I'm currently writing the code to process 4 neighbours at a time, to both eliminate branching and take advantage of SIMD. There's a lot of continues; when checking inbounds etc etc that I can probably eliminate.

I was thinking about other issues with the actual layout of the 1D array. The algos working on the graphs retrieve neighbours both along the x axis and y axis every step, and this would lead to cache misses when reading +- on the y axis. I understand there's techniques to mitigate this such as storing the graphs in Z-Order/Morton order. Has anyone had experience with implementing this? Have they found the more linear memory access offsets the extra cost of calculating the indices? Anyone have any further thoughts on this?

Another question, is it possible to indicate to the Burst compiler than an integer passed into a function is guaranteed to be a power of two, as with the dimensions of my graphs? Or is it worth manually optimising the function to perform bit shifts? Is there any optimisations that can be done to the % to take advantage of power of two widths?

DreamingImLatios · Apr 28, 2021

CPUs have multiple prefetch streams, so while iterating through a column is a bad idea, blasting through rows three rows at a time isn't that big of a deal.

Burst has an Assume intrinsic which you can use to provide it expressions it should know to always be true.

If you can afford to cache two more values along with the width, you can cache the number of bits to shift, which will be faster than any optimization Burst can do. You can also cache a bitmask and use bitwise AND (& operator) to get the least significant bits which represent the modulus.

threedots1 · Apr 28, 2021

DreamingImLatios said: ↑

CPUs have multiple prefetch streams, so while iterating through a column is a bad idea, blasting through rows three rows at a time isn't that big of a deal.

Burst has an Assume intrinsic which you can use to provide it expressions it should know to always be true.

If you can afford to cache two more values along with the width, you can cache the number of bits to shift, which will be faster than any optimization Burst can do. You can also cache a bitmask and use bitwise AND (& operator) to get the least significant bits which represent the modulus.
Click to expand...

Yeah I was considering using the Assume intrinsic over explicitly vectorising my code, but I have a feeling vectorising manually will be faster.

Those optimisations for div/mod look great, thanks.

Pathfinding/floodfill algos can sort of travel straight up/down the y, and my path smoothing iterates along whatever direction the path goes, which obviously can be the worst case of travelling the y axis. I guess the only way to figure out what graph structure is better is to try it and see. Anybody have any personal results using z-order indexing?

DreamingImLatios · Apr 28, 2021

For path smoothing, use bitmasks for your grid directly, representing a node with a single bit (traversable or not).

For floodfill, there's a row-prioritized algorithm that can be vectorized and only requires 4 circular buffers and no recursion. It isn't for the faint of heart though and I don't know a good publicly available implementation off the top of my head.

It has been a while since I did grid-based pathfinding where a naïve solution wasn't good enough. There's lots of tricks, but it really depends on your traversal tendencies.

threedots1 · Apr 28, 2021

Yeah my nodes have passability, connections and costs etc specified/read from a single int. I can't quite use a single bit to respresent the navgraph while smoothing as I'm limiting the smoothing to tiles with the same cost so I don't smooth out the agent navigating through favourable terrain.

Row prioritisation of flood fill seems like an interesting idea. I guess you'd size the circular buffers for map width, and try and traverse as far as possible along x before ticking up y. I can only see a need for 2 buffers here but I haven't thought much about it yet... I might have a play with that.

Oh the current stuff I have here is quite fast and handles a lot of entities quite well, but there's always room for optimisation. Especially since reading and traversing all the graphs is such a fundamental part of the program. That and optimising is fun, and sometimes it's good to take a break from the heavier stuff..

DreamingImLatios · Apr 28, 2021

threedots1 said: ↑

Yeah my nodes have passability, connections and costs etc specified/read from a single int. I can't quite use a single bit to respresent the navgraph while smoothing as I'm limiting the smoothing to tiles with the same cost so I don't smooth out the agent navigating through favourable terrain.
Click to expand...

You can precompute bitmasks for each cost. It's an easy win if you have multiple entities on the same grid or if your smoothing algorithm is greater than O(n) complexity for a decent-length path.

threedots1 · Apr 28, 2021

DreamingImLatios said: ↑

You can precompute bitmasks for each cost. It's an easy win if you have multiple entities on the same grid or if your smoothing algorithm is greater than O(n) complexity for a decent-length path.
Click to expand...

Great idea. Memory footprint of the current tile cost would be 1/32 of the whole map and the checks would be cheap too. Thanks again.

After reading a handful of papers on Morton indexing I don't think I'm going to bother with it. Looks like row major ordering is faster in most cases, even for y traversal. All the math to calculate the index is too expensive, plus the CPU is probably doing some magic prefetching behind the scenes. I'll just stick to vectorising the index calculations and inbounds checks etc, and removing branching.

Razmot · Apr 28, 2021

instead of p.z*w*h + p.y * w + p.x;
I use math.csuml(p * IdxMul ) where IdxMul is a static float3 like float3(16,8,16) for example.

csum is an intrinsic with burst but I don't think it changes much in terms of perf - but it's convenient and helps avoiding some dumb typo bugs.

Search Unity

Unity ID

Useful Searches

2D Array access optimisation and vectorisation