Search Unity

My Burst Feature Wish List

Discussion in 'Burst' started by Mortuus17, Jan 7, 2022.

  1. Mortuus17

    Mortuus17

    Joined:
    Jan 6, 2020
    Posts:
    105
    Disclaimer: I claim to be a power user of Burst, using every single feature on a regular basis which aid in optimizing small and large functions alike, which enables me to abstract them away in order to reuse them, while knowing that the code gen will be fantastic. Burst has come a long way, particulaly with Intrinsics and, more recently, the
    Constant.IsConstantExpression<T>
    compiler features. YOU ARE DOING GREAT WORK!
    But there are a few things that I - as a power user - would still like to have, which I have compiled in the following list.


    1:
    TResult ForceCompileTimeEvaluation<TParams, TResult>(Func<TParams, TResult> code, TParams parameterObj)

    This comes from a bug report I submitted, where a pure function with constant arguments that contained 4 if statements was constant evaluated but not when I added a fifth if statement - which was very frustrating. This issue has not been fixed yet, although I was told that it should be fixed with Burst 1.7. I can live with that - maybe that will be fixed with the next one :)
    But why did it have 4 and then 5 branches? Because it was originally a loop, which I tried to force to be evaluated at compile time, since I know that that particual loop can only run for let's say up to 32 iterations, whereas a compiler cannot, necessarily (the Halting Problem). Please give us a way to pass code and parameters into a Burst specific function that evaluates that code before passing it to LLVM. This would be both a huge performance and productivity gain! This is by far my most requested feature - as the newer C++ versions have it built into the language.

    2: Compile time access to
    FloatMode
    and
    FloatPrecision

    With Intrinsics we have static readonly properties like
    IsAvxSupported
    which always return false in C# land. Can we also have similar properties that return the current
    FloatMode
    and
    FloatPrecision
    , please?
    Why? Let's look at some code:
    x * math.rsqrt(y)
    with
    FloatMode.Fast
    compiles to what you would expect in X86; An SSE rsqrt instruction followed by some Newton-Raphson and a multiply. At
    FloatMode.Strict
    , it is... strict and performs an SSE sqrt FOLLOWED BY A DIVISION AND A MULTIPLICATION, whereas my intent was that in that case it would divide x by the square root of y, only.
    There is no use in abstracting away anything that calls
    math.rsqrt
    or
    math.rcp
    in its own function, when the code has to be rewritten for each job, depending on the
    FloatMode
    . And maintenance is a nightmare, in case one changes the
    FloatMode
    and cares about even minimal changes in performance.
    As a little teaser: It only gets worse with stuff like
    float fourthroot(float x) => rsqrt(rsqrt(x));
    vs
    sqrt(sqrt(x))

    And
    FloatPrecision
    is less important but can probably be done "while you're at it" (same goes for
    OptimizeFor
    ). One example I know of that comes to mind is a recently published fast algorithm for the calculation of the inverse cube root, which comes with code for varying levels of precision. Again: writing it only once would be nice, with a fake switch statement in the function itself.

    3: Support for generics -
    typeof(T1)
    ==
    and
    !=
    typeof(T2)

    Another C++ compile time feature:
    decltype
    . It would be nice to be able to write generic (job) structs with the ability to compile time evaluate the type of the generic argument. Even in C# land
    typeof(T)
    is compile time evaluated ;)

    4:
    [FieldOffset(0)]
    for unmanaged generic types in structs
    Consider the following simple union as an example:
    Code (CSharp):
    1. struct Union<T1, T2> where T1: unmanaged where T2: unmanaged
    2. {
    3.    [FieldOffset(0)] T1 Item1;
    4.    [FieldOffset(0)] T2 Item2;
    5. }
    Currently this fails in Burst, where it says that an explicit
    FieldOffset
    is not allowed for generic types. Although I don't get why - can we make an exception for
    FieldOffset(0)
    specifially? (I see a pattern here -> ) It would be nice to only write the Union type once using generics.

    5: Questions out of interest:
    5.1: Are there any plans on exposing more intrinsics? You have
    umul128
    but
    imul128
    would also be nice... And there are other nice... X86 instructions like let's say...
    bit test
    ,
    rdrand
    /
    rdseed
    , but also LLVM instrinsics that read from the flags register (notoriously difficult to force otherwise)... And of course the big one: AVX512 (at least AVX512 foundation)!!! :) AMD Zen4 is now confirmed to support AVX512.
    5.2: Any chance of inline assembly (as a string passed into a Burst function)? :cool:
    5.3: Is there going to be a software implementation of ARM Neon intrinsics?
    5.4: Are there any plans on intrinsics support for IL2CPP builds? - Meaning e.g.
    IsAvxSupported
    would appropriately evaluate to true or false during an IL2CPP build? After all, at the very least we know that every single X86-64 CPU supports SSE2.

    6: Finally... Two rather small Questions regarding generated X86 code (you might or might not have any influence on)
    6.1: Division of two non compile time constant Unity.Mathematics (u)int vectors is performed in a scalar fashion. Isn't int -cvt-> double -divide-> -cvt-> int 100% safe? At the very least it's faster, even if you have to convert 2 int4s to 4 double2s, for example.
    6.2: In SIMD we have boolean masks being generated from comparisons - they're either all ones or all zeros. To convert such a mask to a C# boolX vector for storing it in memory, the code I've seen uses a mask of 0x01010101... which get's AND-ed with the SIMD mask. Isn't the "abs" instruction much, much better here? (100% no cache miss, also 1 clock cycle latency)
    abs(0xFF)
    is
    0x01



    This is a little too long but I hope that I didn't overwhelm anyone who read this. As a final note I can yet again only say that the Burst team, imo, does great work and is the very foundation DOTS is built upon. Thank you!
     
    Last edited: Jan 7, 2022
  2. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,271
    Something I've always wondered but haven't gotten around to testing is if it is possible to override these for a specific function? So if I have a static class function library which need strict floats to work properly, but want to allow callers to use weaker constraints outside of the functions, is that possible?
     
  3. Mortuus17

    Mortuus17

    Joined:
    Jan 6, 2020
    Posts:
    105
    I tried...:
    Code (CSharp):
    1. [BurstCompile( FloatMode = FloatMode.Fast)]
    2. public struct Test : IJob
    3. {
    4.     public NativeArray<float> output;
    5.      
    6.     [BurstCompile(FloatMode = FloatMode.Strict)]
    7.     float DoIt(float x)
    8.     {
    9.         return rsqrt(x);
    10.     }
    11.     public void Execute
    12.     {
    13.         output[0] = DoIt(output[0]);
    14.     }
    15. }
    ... in hopes that a (legal, after all)
    BurstCompileAttribute
    on an IJob method would override that of its parent IJob, but even that is not the case. Looking at the assembly output, it returns the
    FloatMode.Fast
    version (You could compare it with the same job without the attribute above the
    DoIt
    method).

    The only ways I see for it to work are to either make that use of the Attribute work for methods (even outide of jobs) or to have that related wish of mine fullfilled PLUS the static readonly property not being... readonly - you being able to set (and reset) the
    FloatMode
    and/or
    FloatPrecision
    in your methods.
    I prefer the Arribute working as I thought it should work instead, though.
     
    DreamingImLatios likes this.
  4. sheredom

    sheredom

    Unity Technologies

    Joined:
    Jul 15, 2019
    Posts:
    300
    You cannot override them per function - while for some of these we could make that work with LLVM, for things like float precision that is a job-wide override (so we couldn't easily have low precision cos in one function, and high precision cos in another).

    This is a great list btw, as always with feedback from our users we've all read it, filed tasks for anything new you've proposed, and linked this post with any previous asks for similar features.

    We're hard at work on compile time improvements since 1.5 released (1.6, 1.7, and our next version are all primarily about getting your code compiled faster), but we hope we'll be able to make some time post the next version being released to address exactly these kind of requests!
     
    colinleet, Mortuus17 and R2-RT like this.
  5. Mortuus17

    Mortuus17

    Joined:
    Jan 6, 2020
    Posts:
    105
    Thank you @sheredom, as long as the motivation behind these wishes is understood and agreed upon and they are thus put into your backlog, I couldn't be happier (almost :p).

    And yes, tackling Bursts compilation performance first serves more users than these "advanced" features and should be tackled before introducing additional complexity. Again - great work!
     
    sheredom likes this.