Disclaimer: I claim to be a power user of Burst, using every single feature on a regular basis which aid in optimizing small and large functions alike, which enables me to abstract them away in order to reuse them, while knowing that the code gen will be fantastic. Burst has come a long way, particulaly with Intrinsics and, more recently, the Constant.IsConstantExpression<T> compiler features. YOU ARE DOING GREAT WORK! But there are a few things that I - as a power user - would still like to have, which I have compiled in the following list. 1: TResult ForceCompileTimeEvaluation<TParams, TResult>(Func<TParams, TResult> code, TParams parameterObj) This comes from a bug report I submitted, where a pure function with constant arguments that contained 4 if statements was constant evaluated but not when I added a fifth if statement - which was very frustrating. This issue has not been fixed yet, although I was told that it should be fixed with Burst 1.7. I can live with that - maybe that will be fixed with the next one But why did it have 4 and then 5 branches? Because it was originally a loop, which I tried to force to be evaluated at compile time, since I know that that particual loop can only run for let's say up to 32 iterations, whereas a compiler cannot, necessarily (the Halting Problem). Please give us a way to pass code and parameters into a Burst specific function that evaluates that code before passing it to LLVM. This would be both a huge performance and productivity gain! This is by far my most requested feature - as the newer C++ versions have it built into the language. 2: Compile time access to FloatMode and FloatPrecision With Intrinsics we have static readonly properties like IsAvxSupported which always return false in C# land. Can we also have similar properties that return the current FloatMode and FloatPrecision, please? Why? Let's look at some code: x * math.rsqrt(y) with FloatMode.Fast compiles to what you would expect in X86; An SSE rsqrt instruction followed by some Newton-Raphson and a multiply. At FloatMode.Strict, it is... strict and performs an SSE sqrt FOLLOWED BY A DIVISION AND A MULTIPLICATION, whereas my intent was that in that case it would divide x by the square root of y, only. There is no use in abstracting away anything that calls math.rsqrt or math.rcp in its own function, when the code has to be rewritten for each job, depending on the FloatMode. And maintenance is a nightmare, in case one changes the FloatMode and cares about even minimal changes in performance. As a little teaser: It only gets worse with stuff like float fourthroot(float x) => rsqrt(rsqrt(x)); vs sqrt(sqrt(x)) And FloatPrecision is less important but can probably be done "while you're at it" (same goes for OptimizeFor). One example I know of that comes to mind is a recently published fast algorithm for the calculation of the inverse cube root, which comes with code for varying levels of precision. Again: writing it only once would be nice, with a fake switch statement in the function itself. 3: Support for generics - typeof(T1) == and != typeof(T2) Another C++ compile time feature: decltype. It would be nice to be able to write generic (job) structs with the ability to compile time evaluate the type of the generic argument. Even in C# land typeof(T) is compile time evaluated 4: [FieldOffset(0)] for unmanaged generic types in structs Consider the following simple union as an example: Code (CSharp): struct Union<T1, T2> where T1: unmanaged where T2: unmanaged { [FieldOffset(0)] T1 Item1; [FieldOffset(0)] T2 Item2; } Currently this fails in Burst, where it says that an explicit FieldOffset is not allowed for generic types. Although I don't get why - can we make an exception for FieldOffset(0) specifially? (I see a pattern here -> ) It would be nice to only write the Union type once using generics. 5: Questions out of interest: 5.1: Are there any plans on exposing more intrinsics? You have umul128 but imul128 would also be nice... And there are other nice... X86 instructions like let's say... bit test, rdrand/rdseed, but also LLVM instrinsics that read from the flags register (notoriously difficult to force otherwise)... And of course the big one: AVX512 (at least AVX512 foundation)!!! AMD Zen4 is now confirmed to support AVX512. 5.2: Any chance of inline assembly (as a string passed into a Burst function)? 5.3: Is there going to be a software implementation of ARM Neon intrinsics? 5.4: Are there any plans on intrinsics support for IL2CPP builds? - Meaning e.g. IsAvxSupported would appropriately evaluate to true or false during an IL2CPP build? After all, at the very least we know that every single X86-64 CPU supports SSE2. 6: Finally... Two rather small Questions regarding generated X86 code (you might or might not have any influence on) 6.1: Division of two non compile time constant Unity.Mathematics (u)int vectors is performed in a scalar fashion. Isn't int -cvt-> double -divide-> -cvt-> int 100% safe? At the very least it's faster, even if you have to convert 2 int4s to 4 double2s, for example. 6.2: In SIMD we have boolean masks being generated from comparisons - they're either all ones or all zeros. To convert such a mask to a C# boolX vector for storing it in memory, the code I've seen uses a mask of 0x01010101... which get's AND-ed with the SIMD mask. Isn't the "abs" instruction much, much better here? (100% no cache miss, also 1 clock cycle latency) abs(0xFF) is 0x01 This is a little too long but I hope that I didn't overwhelm anyone who read this. As a final note I can yet again only say that the Burst team, imo, does great work and is the very foundation DOTS is built upon. Thank you!
Something I've always wondered but haven't gotten around to testing is if it is possible to override these for a specific function? So if I have a static class function library which need strict floats to work properly, but want to allow callers to use weaker constraints outside of the functions, is that possible?
I tried...: Code (CSharp): [BurstCompile( FloatMode = FloatMode.Fast)] public struct Test : IJob { public NativeArray<float> output; [BurstCompile(FloatMode = FloatMode.Strict)] float DoIt(float x) { return rsqrt(x); } public void Execute { output[0] = DoIt(output[0]); } } ... in hopes that a (legal, after all) BurstCompileAttribute on an IJob method would override that of its parent IJob, but even that is not the case. Looking at the assembly output, it returns the FloatMode.Fast version (You could compare it with the same job without the attribute above the DoIt method). The only ways I see for it to work are to either make that use of the Attribute work for methods (even outide of jobs) or to have that related wish of mine fullfilled PLUS the static readonly property not being... readonly - you being able to set (and reset) the FloatMode and/or FloatPrecision in your methods. I prefer the Arribute working as I thought it should work instead, though.
You cannot override them per function - while for some of these we could make that work with LLVM, for things like float precision that is a job-wide override (so we couldn't easily have low precision cos in one function, and high precision cos in another). This is a great list btw, as always with feedback from our users we've all read it, filed tasks for anything new you've proposed, and linked this post with any previous asks for similar features. We're hard at work on compile time improvements since 1.5 released (1.6, 1.7, and our next version are all primarily about getting your code compiled faster), but we hope we'll be able to make some time post the next version being released to address exactly these kind of requests!
Thank you @sheredom, as long as the motivation behind these wishes is understood and agreed upon and they are thus put into your backlog, I couldn't be happier (almost ). And yes, tackling Bursts compilation performance first serves more users than these "advanced" features and should be tackled before introducing additional complexity. Again - great work!