Why does Burst generate so many instructions for collapsing a bool4?

DreamingImLatios · Apr 4, 2019

I'm investigating different options for optimizing a hot path in my code. I managed to get Burst to vectorize my comparison which turned out to be a win, but it is not as big of a win as it could be because Burst is doing this crazy gymnastics to get me a single bool from a bool4. Why is it not using vtestps?

xoofx · Apr 5, 2019

DreamingImLatios said: ↑

Why is it not using vtestps?
Click to expand...

Likely an oversight on our AVX codegen paths (as it is only available there). We will have a look.

Although note that while AVX is supported in the inspector, it is currently not supported in the player mode and the JIT as we have to correctly handle AVX/SSE transitions.

5argon · Apr 5, 2019

At first I thought it is because
1. `vtestps` is to put the comparison result to flag registers (which can't be taken out?) for use with branch command, but math.any/all could also be used to get assignable result. So this sequence is to ensure just in case you want to store the result.
2. Burst is not aware that you are going to use the result of any/all exclusively with `if` clause without storing it anywhere (in which case vtestps looks to be the best solution)

Great to know that AVX code gen count detect the usage up to that extent!

DreamingImLatios · Apr 5, 2019

xoofx said: ↑

Although note that while AVX is supported in the inspector, it is currently not supported in the player mode and the JIT as we have to correctly handle AVX/SSE transitions.
Click to expand...

Wait. Now I am confused. I am using [PerformanceTest] in the editor to try and get a rough idea how different generated assembly performs on my CPU (5820K). I assumed that what Burst showed for the "auto" option was what my computer was running.

Am I looking at the wrong assembly? If so, what assembly should I be looking at?

Also, all of the SSE versions are doing similar gymnastics. movmskps can do a similar shortcut. So it seems like it is a missing optimization across the board.

xoofx · Apr 5, 2019

DreamingImLatios said: ↑

Am I looking at the wrong assembly? If so, what assembly should I be looking at?
Click to expand...

Currently it is X64_SSE4 if you are running on x64 and your CPU has at minimum SSE4.2 support, otherwise it will fallback to X64_SSE2

DreamingImLatios · Jul 5, 2019

3 months later, and this issue has been quietly fixed, sortof...

TL;DR: math.bitmask is awesome and I am seeing awesome speedups! It is superior to math.any/all.

So I use the performance test framework to optimize my algorithms. And in one algorithm, part of my inner loop tests 4 "lesser" floats against 4 "greater" floats and branches based on whether or not the "lessers" are truly lesser than the "greaters". In my particular test, data alignment sucks and any simd optimizations requires that each float be manually packed into the xmm register one by one.

I have three variations of this algorithm all using the same input data.

Naive - I have four comparisons in C# ORed together. Burst produces 4 jump commands which thrashes the branch predictor.

Better - I make two float4s and initializing them to the lessers and greaters. This becomes 2 moves and 6 inserts into the xmm registers. I then do a single comparison between the float4s and do a math.any/all (I tried a few different combinations and they all lead to the same instruction count as my first post).

New - I did the same thing as "Better" but instead of the math.any/all I checked if the bitmask of the bool4 was equal to 0.

As of posting this, this requires the 1.1.0 previews of Burst and Mathematics.

Unfortunately math.any/all don't seem to use these intrinsics yet, which is why I said the issue was only "sortof" solved. Do I care? Not really. I'm just hyped to have this performance!

sheredom · Nov 7, 2019

Just a heads up that I looked at the codegen for math.any/math.all in the 1.2.0-preview.9 release, and made it more optimal.

DreamingImLatios · Nov 8, 2019

sheredom said: ↑

Just a heads up that I looked at the codegen for math.any/math.all in the 1.2.0-preview.9 release, and made it more optimal.
Click to expand...

Cool! It does seem to be slightly more optimal compared to what it used to be. It got rid of the pand right after the compare.

However, bitmask still is a fair bit faster.

And this is what happens when my data alignment doesn't suck (from the real algorithm and not the test case).

I'm still not sure what the movups are about as they are both loads from a NativeSllice<float4> and I would expect them to be movaps. But it is in tight loops like these where optimized operations on bool4 can make a pretty big difference performance-wise.

sheredom · Nov 11, 2019

This was still annoying me so I looked further - turns out one of the reasons the codegen wasn't perfect was we were using an ordered not-equal compare (EG. special handling for NaNs), whereas we should have been using an unordered not-equal compare (which is what C# actually does).

So I'm fixing this in a future release of Burst - thanks for nitpicking on this performance issue because it unveiled an underlying bug between Burst and C#!

sngdan · Nov 11, 2019

Great - just see this now - I had extensive tests on math.any / .all for a broad phase and they were not any faster than if.

i admit this was ages ago - seems like a good time to revisit

DreamingImLatios · Nov 12, 2019

Yup! That is how I discovered the issue too. Just out of curiosity, what broadphase did you end up using? IIRC you were doing something with voxels?

Anyways, the code I tend to write often exposes missing compiler optimizations, so I will be sure to point out more of them as time goes on.

DreamingImLatios · Feb 15, 2020

This appears to be fixed in 1.3!

Search Unity

Why does Burst generate so many instructions for collapsing a bool4?

DreamingImLatios

xoofx

Unity Technologies

5argon

DreamingImLatios

xoofx

Unity Technologies

DreamingImLatios

sheredom

Unity Technologies

DreamingImLatios

sheredom

Unity Technologies

sngdan

DreamingImLatios

DreamingImLatios

Search Unity

Unity ID

Useful Searches

Why does Burst generate so many instructions for collapsing a bool4?

Unity Technologies

Unity Technologies

Unity Technologies

Unity Technologies