Search Unity

Why does Burst generate so many instructions for collapsing a bool4?

Discussion in 'Burst' started by DreamingImLatios, Apr 4, 2019.

  1. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,271
    I'm investigating different options for optimizing a hot path in my code. I managed to get Burst to vectorize my comparison which turned out to be a win, but it is not as big of a win as it could be because Burst is doing this crazy gymnastics to get me a single bool from a bool4. Why is it not using vtestps?

    upload_2019-4-4_0-53-8.png
     
    5argon and yc960 like this.
  2. xoofx

    xoofx

    Unity Technologies

    Joined:
    Nov 5, 2016
    Posts:
    417
    Likely an oversight on our AVX codegen paths (as it is only available there). We will have a look.

    Although note that while AVX is supported in the inspector, it is currently not supported in the player mode and the JIT as we have to correctly handle AVX/SSE transitions.
     
  3. 5argon

    5argon

    Joined:
    Jun 10, 2013
    Posts:
    1,555
    At first I thought it is because
    1. `vtestps` is to put the comparison result to flag registers (which can't be taken out?) for use with branch command, but math.any/all could also be used to get assignable result. So this sequence is to ensure just in case you want to store the result.
    2. Burst is not aware that you are going to use the result of any/all exclusively with `if` clause without storing it anywhere (in which case vtestps looks to be the best solution)

    Great to know that AVX code gen count detect the usage up to that extent!
     
    Last edited: Apr 5, 2019
  4. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,271
    Wait. Now I am confused. I am using [PerformanceTest] in the editor to try and get a rough idea how different generated assembly performs on my CPU (5820K). I assumed that what Burst showed for the "auto" option was what my computer was running.

    Am I looking at the wrong assembly? If so, what assembly should I be looking at?

    Also, all of the SSE versions are doing similar gymnastics. movmskps can do a similar shortcut. So it seems like it is a missing optimization across the board.
     
  5. xoofx

    xoofx

    Unity Technologies

    Joined:
    Nov 5, 2016
    Posts:
    417
    Currently it is X64_SSE4 if you are running on x64 and your CPU has at minimum SSE4.2 support, otherwise it will fallback to X64_SSE2
     
  6. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,271
    3 months later, and this issue has been quietly fixed, sortof...

    TL;DR: math.bitmask is awesome and I am seeing awesome speedups! It is superior to math.any/all.

    So I use the performance test framework to optimize my algorithms. And in one algorithm, part of my inner loop tests 4 "lesser" floats against 4 "greater" floats and branches based on whether or not the "lessers" are truly lesser than the "greaters". In my particular test, data alignment sucks and any simd optimizations requires that each float be manually packed into the xmm register one by one.

    I have three variations of this algorithm all using the same input data.
    • Naive - I have four comparisons in C# ORed together. Burst produces 4 jump commands which thrashes the branch predictor.
    • Better - I make two float4s and initializing them to the lessers and greaters. This becomes 2 moves and 6 inserts into the xmm registers. I then do a single comparison between the float4s and do a math.any/all (I tried a few different combinations and they all lead to the same instruction count as my first post).
    • New - I did the same thing as "Better" but instead of the math.any/all I checked if the bitmask of the bool4 was equal to 0.
    upload_2019-7-5_16-17-47.png

    As of posting this, this requires the 1.1.0 previews of Burst and Mathematics.

    Unfortunately math.any/all don't seem to use these intrinsics yet, which is why I said the issue was only "sortof" solved. Do I care? Not really. I'm just hyped to have this performance!
     
  7. sheredom

    sheredom

    Unity Technologies

    Joined:
    Jul 15, 2019
    Posts:
    300
    Just a heads up that I looked at the codegen for math.any/math.all in the 1.2.0-preview.9 release, and made it more optimal.
     
    MNNoxMortem, 5argon, JesOb and 2 others like this.
  8. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,271
    Cool! It does seem to be slightly more optimal compared to what it used to be. It got rid of the pand right after the compare.
    upload_2019-11-7_23-47-1.png

    However, bitmask still is a fair bit faster.
    upload_2019-11-7_23-49-57.png

    And this is what happens when my data alignment doesn't suck (from the real algorithm and not the test case).
    upload_2019-11-8_0-6-20.png
    I'm still not sure what the movups are about as they are both loads from a NativeSllice<float4> and I would expect them to be movaps. But it is in tight loops like these where optimized operations on bool4 can make a pretty big difference performance-wise.
     
  9. sheredom

    sheredom

    Unity Technologies

    Joined:
    Jul 15, 2019
    Posts:
    300
    This was still annoying me so I looked further - turns out one of the reasons the codegen wasn't perfect was we were using an ordered not-equal compare (EG. special handling for NaNs), whereas we should have been using an unordered not-equal compare (which is what C# actually does).

    So I'm fixing this in a future release of Burst - thanks for nitpicking on this performance issue because it unveiled an underlying bug between Burst and C#! :D
     
    funkyCoty, ippdev, Peter77 and 10 others like this.
  10. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    1,154
    Great - just see this now - I had extensive tests on math.any / .all for a broad phase and they were not any faster than if.

    i admit this was ages ago - seems like a good time to revisit
     
  11. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,271
    Yup! That is how I discovered the issue too. Just out of curiosity, what broadphase did you end up using? IIRC you were doing something with voxels?

    Anyways, the code I tend to write often exposes missing compiler optimizations, so I will be sure to point out more of them as time goes on.
     
  12. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,271
    This appears to be fixed in 1.3!

    upload_2020-2-15_16-34-20.png
     
    chadfranklin47, sheredom and Megafunk like this.