Search Unity

HPC# vs .NET Core 3 Benchmark

Discussion in 'Entity Component System' started by davreev, Oct 18, 2019.

  1. davreev

    davreev

    Joined:
    Oct 10, 2013
    Posts:
    5
    After watching one of the talks on Burst compilation from Unite Copenhagen, I became a bit suspicious of the performance gains being presented so I decided to run some simple benchmarks of my own as a sanity check (see details below).

    While I did find a significant performance difference between Burst-compiled C# and standard C# in Unity (as showcased at Unite), the difference between Burst-compiled C# and standard C# compiled with .NET core 3.0 was negligible - in fact Burst-compiled C# was slower until safety checks were disabled.

    Needless to say these findings were pretty disappointing and have left me feeling somewhat misled by the folks at Unity. While I appreciate the coordinated push towards data-oriented design, it would appear that "performance by default" is something that the language already offers out of the box - just not within the development environment provided by Unity.

    In any case, I thought I'd share the results here in case my benchmarks are deeply flawed in some way that I've completely overlooked. Given that I've spent the last few weeks working on Burst implementations of performance sensitive algorithms, I sincerely hope they are.

    Details

    The benchmark performs 1 million axpy operations on single precision 3-dimensional vectors.

    axpy-bench-summary.png
    Potentially(?) relevant Unity settings:
    • Project Settings
      • Burst AOT settings
        • Optimizations: enabled
        • Safety checks: disabled
        • Burst compilation: enabled
      • Player settings
        • Scripting runtime version: .NET 4.x Equivalent
        • Scripting backend: IL2CPP
        • C++ compiler configuration: Release
    • Preferences
      • External Tools
        • Editor Attaching: disabled
    * Edit: Add table
    * Edit: Update benchmarks with "Editor Attaching" disabled
     
    Last edited: Oct 19, 2019
    Mark-Currie and burningmime like this.
  2. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,759
  3. davreev

    davreev

    Joined:
    Oct 10, 2013
    Posts:
    5
    Thanks, should've known someone had already done a deep dive on this. Looks like a much more useful/complete comparison.

    That said, these don't seem to capture the massive performance difference between Burst-compiled C# and standard C# when writing code inside Unity.

    What concerns me here is that this difference is being presented as a feature but, given the performance of the language outside of Unity, I see it more as a bug in that there appears to be a significant performance penalty for writing non-Burst C# in Unity.
     
  4. sschoener

    sschoener

    Joined:
    Aug 18, 2014
    Posts:
    73
    The original Unity implementation is the MonoJit one in the table, so it does show the performance difference. The Mono version used by Unity is ancient (based on version 2.7 or something like that IIRC) and the quality of the actual CPU instructions produced by that version's jit compiler is somewhere between abysmal and laughable. So yes, that's definitely slower than what you'd get when running C# using the most recent versions of Mono of RyuJit :)
     
    davreev likes this.
  5. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    5,203
    Additionally while nxrighthere benchmark suite is definitely real world use cases, that do exist, it does not cover the ones most important for performance in games. The benchmark is not what burst is currently optimized towards. Burst is optimized for writing SIMD code, making that process simpler & performant.

    None of the examples in the benchmark showcase that. It is difficult to compare to other languages because it requires a combination of math library & compiler...

    I am not trying to take a piss on the benchmark. It is definately real. There is plenty of code that is scalar & relies on auto-vectorization to get faster. And burst is generally on par or better than C++ at it. And signficantly better than RyuJIT at it but it's not been the primary focus.

    What we primarily want to enable with burst is writing high performance SIMD code. Auto vectorization is nice to have, but not the primary goal of burst. Primarily because we want to make sure that programmers can control performance & not get it accidentally.

    This is difficult to benchmark because there is no apple to apple comparison here. Because really it comes down to making it easy to write high performance code with a good math library supported by a compiler. This is something that doesn't exist anywhere else.



    Do note that publishing to the new Dots Runtime that we will ship in preview in the next couple of months supports .NET core as the deployment target. Additionally if you compare CPU performance, we generally put all effort in optimization for normal OO code into IL2CPP not into mono. The expecation is that for final builds our users should use IL2CPP if performance is a concern in any shape or form...
     
    Last edited: Oct 18, 2019
    shamsfk, MNNoxMortem, Orimay and 7 others like this.
  6. starikcetin

    starikcetin

    Joined:
    Dec 7, 2017
    Posts:
    340
    Dude... You are supposed to turn them off during a benchmark. They only exist for development-time purposes.
     
    Orimay and RaL like this.
  7. runner78

    runner78

    Joined:
    Mar 14, 2015
    Posts:
    792
    In the benchmark, sometimes RyuJIT is faster or equal to IL2CPP. But IL2CPP is not very suited for Games with mod-support.
     
    forestrf, dadude123 and JesOb like this.
  8. davreev

    davreev

    Joined:
    Oct 10, 2013
    Posts:
    5
    Ah, thanks - that's very clear.

    Doesn't this make some of the live Burst demonstrations feel even more misleading though? It'd be like if I chose to demonstrate how fast my car was by racing it against a go-kart...
     
  9. davreev

    davreev

    Joined:
    Oct 10, 2013
    Posts:
    5
    AFAICT, safety checks in this particular benchmark amount to indexer bounds checks on NativeArray*. These are always performed on managed arrays in C# but can be optimized away by the compiler in some circumstances. Seeing as how this is an attempt to compare compilers, one might argue that it's more fair to keep the safety checks on then. In any case, I gave Burst the benefit of the doubt and turned them off in the end so not sure what the issue is.

    * Please correct me if I'm wrong here

    ** Edit: Add link
     
  10. sschoener

    sschoener

    Joined:
    Aug 18, 2014
    Posts:
    73
    You're welcome :) I guess in the end it depends on who you are talking to: If you have been working with Unity for a few years now, these speed-ups are very real and very relevant. I agree that this perspective is probably rightfully different for someone coming from a different background. I'd say that some context is implied when Burst is mentioned, namely that Burst gives you much better performance while still using Unity. On that ground, that's still very exciting :)

    I also agree with Joachim here that the story is more nuanced, really: It's not just this performance gain, it's also cross-platform SIMD, having a compiler that knows about Unity, can do aliasing analysis, owning the whole toolchain etc.
    Unfortunately, I only know a handful of people that get as excited about cross-platform SIMD :D whereas more performance for Unity NOW is something most people immediately get and is probably a better marketing move ;)

    I've taken a quick look at your benchmark (thanks for posting! :)) and there are some caveats -- maybe you already caught those since I think they are also mentioned in the benchmarking thread?
    • consider using synchronous compilation for Burst (you can specify that via an argument to the BurstCompile argument) since you otherwise might also benchmark compilation
    • in the serial example, consider using Run() instead of Schedule().Complete() since the latter will also go through the job system (unnecessarily so)
    I'm not a 100% sure whether that will change your benchmarks since you said that you were merely eyeballing them from the profiler, so maybe those pitfalls don't apply
     
    Last edited: Oct 19, 2019
    davreev likes this.
  11. starikcetin

    starikcetin

    Joined:
    Dec 7, 2017
    Posts:
    340
    As far as I know all safety checks are stripped away if you turn them off.
     
  12. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Sounds great, how limited out of the box in terms of extensibility it will be? Will we be able to use a full vector of C# features, managed objects, and 3rd-party libraries?
     
  13. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    5,203
    One of the use cases we want to enable is that you might want to have a kestral web server but drive world creation & loading from there. So you can be in charge of the main loop from your own C# code, which opens up DOTS simulation code to a whole category of different uses on servers etc.
     
  14. davreev

    davreev

    Joined:
    Oct 10, 2013
    Posts:
    5
    Yeah, I can appreciate this. I've been using Unity for some time now but I suppose that my current context is somewhat unusual i.e. developing external libraries for use across a few different platforms (one of which being Unity). It seems like my options here are limited to (1) include Burst-compatible implementations of certain algorithms and data structures or (2) accept the performance regression that comes with using external libraries in Unity due to Mono JIT.

    Thanks for taking a look. Making these changes appears to have *slightly* improved the serial implementation but it's hard to say for sure without taking more rigorous measurements. In any case, these sorts of details are good to know about so thanks for the feedback.

    An additional note for the Mono JIT benchmark: disabling "Editor Attaching" (as suggested in this blog post) makes a pretty significant difference so I've updated the summary to reflect this.
     
    Last edited: Oct 19, 2019
  15. sschoener

    sschoener

    Joined:
    Aug 18, 2014
    Posts:
    73
    Mhm, I don't think there really are other options if you want to stick to C#. If you are aiming for high performance, you should also keep in mind that the performance characteristics of Unity's Mono version can be quite different compared to the latest .NET Core runtimes; there's a different GC in place and Unity still lacks support for Span etc. afaik.

    This is probably not all that relevant to you anymore, but I've taken another close look at your benchmark and wanted to share some findings :) First, I've changed your code to use 4-wide vectors (nicer alignments). Then I put the benchmarking code into SharpLab online to see what the latest RyuJit would do to it (select JIT Asm and Release on the right hand side). The relevant code for the parallel version is the very last function, AxpyBenchmark+<>c__DisplayClass11_0.<AxpyParallel>b__0(System.Tuple`2<Int32,Int32>).
    I notice the following things:
    • There is a redundant vzeroupper in the parallel implementation (note how all latter SSE instructions use the VEX prefix). The story here is that XMM registers extend into YMM registers and the old SSE instruction isn't aware of it, so newer processors with AVX support (~85% of users on Steam, see Other Settings on the bottom of this page) pay a penalty for intermixing old SSE instructions and new YMM instructions since the processor needs to ensure that the upper bits of the register are unaffected by the legacy SSE instructions. vzeroupper gets rid of that, but so does using the new VEX-prefix instructions (like vmovss instead of movss). See this post for someone much smarter than me explaining the issue. Now calling this vzeroupper instruction is dirt cheap for all I know, but I'm bringing this up because I want to get back to it later.
    • The output lacks any form of vectorization, but has detected that AVX is available (since it's using SSE instructions with VEX prefix). For example, this is the load of A and the multiplication with X:
      Code (CSharp):
      1.     L0032: vmovss xmm0, dword [edi]
      2.     L0036: vmovss xmm1, dword [edi+0x4]
      3.     L003b: vmovss xmm2, dword [edi+0x8]
      4.     L0040: vmovss xmm3, dword [edi+0xc]
      5.     L0045: mov edi, [ecx+0xc]
      6.     L0048: cmp edx, [edi+0x4]
      7.     L004b: jae L0105
      8.     L0051: vmovss xmm4, dword [edi+edx*4+0x8]
      9.     L0057: vmulss xmm0, xmm0, xmm4
      10.     L005b: vmulss xmm1, xmm1, xmm4
      11.     L005f: vmulss xmm2, xmm2, xmm4
      12.     L0063: vmulss xmm3, xmm3, xmm4
    • The compiler did not eliminate the bounds checks on the array accesses, not even in the serial version. This is easy to spot for this compiler because there is a call plus trap-to-debugger interrupt in the very end of the compiled procedure; the cmp edx, [edi+0x4] followed by jae L0105 in the piece above is exactly one such bounds-check plus a jump to the exceptional path.
    So there is room for improvement here :)

    Let's look at the Burst compiled version. I have again modified it to use float4 instead of float3. I'm using Burst 1.2 preview-6 though I get the same result for Burst 1.1.2. You can look at the code yourself by using the Burst inspector from the Jobs/Burst toolbar item. I have disabled safety checks and fast math (the latter setting does not affect code-generation in this specific example but is generally helpful because it allows the compiler to reorder floating point operations more freely). The instruction set is set to auto, since this is also what the AOT code generation will use. This seems to use SSE4.x instructions and have no support for AVX (which makes sense when you build for x64 Windows since that is guaranteed to have some SSE support).
    • The core loop boils down to this:
      Code (CSharp):
      1. .LBB0_5:
      2.         movups  xmm0, xmmword ptr [rdx + 4*rcx]
      3.         movss   xmm1, dword ptr [rbx + rcx]
      4.         shufps  xmm1, xmm1, 0
      5.         mulps   xmm1, xmm0
      6.         movups  xmm0, xmmword ptr [rbp + 4*rcx]
      7.         addps   xmm0, xmm1
      8.         movups  xmmword ptr [rsi + 4*rcx], xmm0
      9.         add     rcx, 4
      10.         dec     rax
      11.         jne     .LBB0_5
      This is quite a bit shorter than the core part from the other compiler, though brevity does of course not always equal speed. This procedure is using vector instructions to do the actual computation: note the mulps, addps instructions (ps = packed single, ss = scalar single). The single scalar instruction (movss) is the load of X; the shufps right after that is broadcasting the value into the full XMM register.
    • Irritatingly, Burst (rather LLVM) decided to emit movups instructions instead of movaps since the system should be able to ensure that those float4s are 16 byte aligned.
    • All instructions are pure SSE without any prefix. This of course depends on the environment, but on my machine I might be paying for those if there is any, say, driver code, that felt clever for using my AVX2 enabled processor. Not sure what the right solution to this (since vzeroupper can only be used when you already know that you have AVX) and whether this is even a problem (I'd be curious to see whether a call to vzeroupper in there would improve things ever so slightly?)
    Of course, the proof of the pudding is in the eating and ultimately it comes down to execution times, not a literary review of the generated assembly. So why isn't Burst faster in this case? Well, I've only had a cursory look at it using VTune and it claims to be memory bound on my machine. At least on my machine, I can substitute A with math.sqrt(A) for free, no additional runtime cost. Similarly, I get a quite dramatic speedup when I write back into A instead of Result, even though the only difference in the assembly is that movups xmmword ptr [rsi + 4*rcx], xmm0 uses rdx instead of rsi.

    So that was a lot of fun :) I guess the lesson here is that sometimes benchmarks don't measure what you expect them to measure. This benchmark probably can only distinguish very bad code generation from somewhat decent code generation, but beyond that it won't help: The difference in theoretical latency of the generated instructions will be dwarved by memory access times.

    Edit: I have repeated the measurements using float3 and the same conclusions apply. It's not enough computation per read/write.
     
    Last edited: Oct 20, 2019
  16. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    So in theory our Burst code could waste compute cycles and bandwidth by not doing enough processing on the data?

    Does this mean there will be a sweet spot between data size and code size that will maximise throughput for Burst code on different hardware, and without using a tool like VTune we have no way of knowing how efficient our code is using the cpu and available memory bandwidth?

    As I had the concept that with a DOTS based approach once you have written a small library of atomic systems* that cover most games then it would just be a matter of interlinking those systems and adding some meta game code and you would be able to write any game.

    It sounds like a flaw in going for small code atomic system will be available memory bandwidth, are there any ways to detect this issue in the Unity profiler?

    *Admittingly I thought that as there are only about 20-50 operations that can be performed on a CPU if you write a system for +,-,/,*,&,|,! and vector ops, etc then you could make any super fast program using a range of 20-50 atomic DOTS systems (Meta programming).
     
    Last edited: Oct 20, 2019
  17. burningmime

    burningmime

    Joined:
    Jan 25, 2014
    Posts:
    845
    Just bumping this since I looked it up for another thread.

    I feel like the real comparison shouldn't be vs Burst, but rather in speeding up the main thread, which is likely the bottleneck even in many DOTS-enabled games. A lot of the core logic (and especially load-time logic) lives in the main thread. Jobifying things takes a lot of effort (engineering time/dollars), and older code/asset store stuff is likely not jobified.

    An order-of-magnitude faster scripting runtime/GC on the main thread would help out everybody, and even in games with heavy use of Burst would improve editor performance and initialization times.

    Burst is incredible for SIMD math and "embarrassingly parallel" problems, but there's a lot of code that doesn't fit that model.

    Here's a post form 2018 where a Unity engineer did it for a (hackathon?) and describes some of the challenges: https://xoofx.com/blog/2018/04/06/porting-unity-to-coreclr/
     
    Kolyasisan likes this.