Search Unity

Benchmarking Burst/IL2CPP against GCC/Clang machine code (Fibonacci, Mandelbrot, NBody and others)

Discussion in 'Burst' started by nxrighthere, Jul 23, 2019.

Thread Status:
Not open for further replies.
  1. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194

    Just thought you could use some graphs (lower is better).

    Just based on the graphs (data from git project page) if we could optionally IL2CPP via GCC or Clang we could have code that runs faster than Burst 8 out of 10 times.
     
    Last edited: Sep 13, 2019
    sasa42 likes this.
  2. jamespaterson

    jamespaterson

    Joined:
    Jun 19, 2018
    Posts:
    401
    Could you clarify please? Do you mean you expect burst to do better than C++ without use of e.g. SIMD instrinsics? Or that burst will do better than hand optimised C++/SIMD code? Thanks.
     
  3. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Last edited: Sep 13, 2019
  4. xoofx

    xoofx

    Unity Technologies

    Joined:
    Nov 5, 2016
    Posts:
    417
    @nxrighthere could you try with latest burst 1.2.0-preview.1 ? as the Clang version in burst has been upgraded to 8.0+, so I'm curious to see if there is any differences in the results
     
  5. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    @Arowx I was thinking about the implementation of graphs generator directly in Unity, just need more spare time for this as well as to make a vectorized version of the tests...

    As for -O3 compiler option, I've tried it before and yes, while it slows down everything up to 3 times, Clang and Burst becomes closer to GCC (and even a bit faster) in a few tests. Phoronix is using real-world software in their tests which beyond of simple switching of compiler options, there are too many external factors unlike in my relatively tiny tests.

    @xoofx Sure, it will not take much time.
     
  6. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    I made the above graphs in Unity, I can share the project if your interested, it will need to be adapted for your use but it's a start?
     
  7. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Sure, drop it to my email, I'll take a look.
     
  8. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    @xoofx Done. Here's the diff for 1.2.0-preview.1 (the left one is before). Noticeable changes: new version improved performance of the raytracer by 55% but still behind the Clang. Fast float mode still slows down the fireflies flocking by 5% instead of making it faster for some reason.
     
    sheredom, elcionap and hippocoder like this.
  9. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Added Radix benchmark, no unexpected results, at integer math Burst is solid. So yea, that's it. The next step would be an implementation of a vectorized version of the tests. This will require to link against SLEEF (to make it fair comparison with Burst which utilizing it) and replacing all standard floating-point math functions.

    I'll keep my eye on Burst updates, and come back from time to time. Cheers.
     
  10. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Last edited: Sep 13, 2019
    nxrighthere likes this.
  11. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    @Arowx Thanks for the package, I'll play with it tomorrow.

    I can answer this question instead of Joachim and clarify things.

    Burst's foundation is Clang (frontend) with LLVM (backend) that processing intermediate representation (IR) data structure transpiled from intermediate language (IL) which obtained from high-performance subset of C#.

    With the same foundation, C++ is transpiled to IR as well, and passes the same optimization steps done by LLVM which produces machine code.

    So how to make Burst with C# faster than C++ with the same foundation:
    • More optimizations with LLVM that upstream doesn't have
    • Non-aliasing memory access through custom abstractions such as native containers (C++ over the years is moving from work with raw pointers to abstractions that lack such possibility, and unlike C, C++ doesn't have a standardized way to do that manually, it's available only as a compiler extension which people are rarely using)
    • The compiler can apply more optimizations with restrict qualifier, therefore, making the code perform faster without aliasing (it's not always possible)
    Current state:
    • Burst is on par with Clang at auto-vectorization, if you work with pointers you need to use restrict qualifier to make it better
    • If you vectorize by hand using SIMD vector types and functions, the difference depends on SIMD abstraction layer
    • There are some issues with scalars that eventually should be resolved, but it's always better to use SIMD vector types if target architecture allows this
    So, you already have tools to make Burst with subset of C# to do better than Clang with C++.
     
    Last edited: Sep 14, 2019
  12. jamespaterson

    jamespaterson

    Joined:
    Jun 19, 2018
    Posts:
    401
    Thanks for your detailed response
     
  13. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    5,203
    I think fundamentally the real strength of burst + math library, at this point is that it makes it very easy to write cross platform SIMD code. You have to lay out data in SOA form & use float4 etc.

    In a benchmark against C++ you could do the same by writing platform specific intrinsics, but of course thats harder to read code & will only run on one platform. Making a user friendly C++ math library that generates good SIMD code across all supported unity platforms and making all the compilers actually generate good code is something that I don't think is actually possible in C++. We sure have tried with our math library in C++, but we were never really happy with the generated code on every platform. Not having control over the whole stack is a pain in the ass.


    There is clearly much more we want all the way from being able to write the lowest level arch specific intrinsics in C# all the way to making the compiler do & enforce the boring work of SOA unrolling your loops.

    Ultimately there is some things Burst is already much better at than what you can do in C++.

    The end goal is to be always as good or better than C++ in terms of performance, while making it significantly simpler to write performant code.
     
  14. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Would it be possible for Burst to refactor data from AOS to SOA and convert a structure with float x,y,z to float4?

    In theory, if it could, then it should outperform the other compilers in this benchmark on the Particles, NBody and Raytracer tests.
     
  15. sheredom

    sheredom

    Unity Technologies

    Joined:
    Jul 15, 2019
    Posts:
    300
    It's possible to do auto AOS to SOA, but incredibly tricky to get right for various reasons. It is on our long term 'we should investigate' roadmap though.
     
  16. runner78

    runner78

    Joined:
    Mar 14, 2015
    Posts:
    792
    Out of pure curiosity, i would suggest to add .NET core to the benchmark.
     
  17. hippocoder

    hippocoder

    Digital Ape

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    And then you may as well also benchmark the results on different CPU architectures, as this makes a fairly big difference to some compilers...
     
  18. runner78

    runner78

    Joined:
    Mar 14, 2015
    Posts:
    792
  19. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    People in my community requested several times to run benchmarks with CoreCLR too, so here are results for RyuJIT (.NET Core 2.2.402). The implementation is available here.
     
    runner78 and hippocoder like this.
  20. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194

    Oops, had to hand tweak my graph generator.
     
  21. runner78

    runner78

    Joined:
    Mar 14, 2015
    Posts:
    792
    I wonder why the RyuJIT take so long on the "Sieve of Eratosthenes" and "Particle Kinematics" benchmarks.
     
  22. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    I reported this to CoreCLR developers yesterday, you can track the issue here. Mono guys discovered some issues as well.
     
  23. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Thanks to the .NET team, it was enabled Tiered Compilation which negatively affects performance in the tests. Here's the diff with much better results across all benchmarks.
     
  24. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
  25. Vincenzo

    Vincenzo

    Joined:
    Feb 29, 2012
    Posts:
    146
    Sorry to bump this thread but I found out why the performance of mono in this test is so low.

    Floats are calculated as doubles!
    This is a known problem that the mono project solved in april of 2018.

    We can have an easy free lunch here,

    I made a Pullrequest on the unity mono Github.

    Please unity team consider merging in these simple changes, it should costs your programmer not more than a hour. and would make all our projects float performance 50 till 100% faster!
    @Joachim_Ante @xoofx

    https://github.com/Unity-Technologies/mono/pull/1258
     
    Last edited: Feb 18, 2020
  26. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    I've mentioned this problem here on forums about two years ago or so, and I believe Aras for example, was aware of it, but it was ignored, unfortunately.

    This is actually only a tip of the iceberg, if you dig even deeper, you will notice that the JIT itself is lack of many intrinsic functions and miss crucial optimizations from the upstream.
     
    Last edited: Feb 23, 2020
    NotaNaN likes this.
  27. jamespaterson

    jamespaterson

    Joined:
    Jun 19, 2018
    Posts:
    401
    Thanks for making the pull request. Out of interest in the scenario where unity do not accept the PR does anyone know if it is possible to shim in a custom build of the unity branch of mono into a unity standalone build?
     
  28. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    The actual discussion is here. The PR which @Vincenzo has made is just a small initial step, way more changes from the upstream are required to solve this, especially math functionality, it should be properly adjusted.
     
    jamespaterson likes this.
  29. Vincenzo

    Vincenzo

    Joined:
    Feb 29, 2012
    Posts:
    146
    I honestly don't expect unity to accept the PR, they could merge in the changes with ease but they are affraid to hurt current projects for some reason.

    Whilst il2cpp does treat floats as floats, burst too, i don't know why they wish not to touch mono projects.

    Like Nx said there is way more missing, but this is an easy step to higher float performance, after such merge you also probably need all this stuff to make it work properly at least:
    https://github.com/mono/mono/commit/b7db3364a0d76bcc2f16c4ea5237216a59080432

    But the point here is just that unity is leaving us in the dust for 2 years, and not everybody can switch to burst/ecs/dots/jobs because for instance like me, some people have a 200+ GiB, 200k LOC project. and it's performance were after for stuff released today, not in 2 years from now.
     
  30. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,270
    While I would love better Mono performance when debugging, I don't agree with the argument here. Burst and Mathematics have been out of Preview for nearly a year now. I recently gave a workshop where I used jobs and Burst to update 100,000 virtual objects and prioritize them into 1000 real animated GameObjects in the cost it would take to add another 1000 GameObjects. That involved running frustum culling and sorting of the virtual objects. If you are really math bound, your code is probably in a format for easy jobification.


    There are people using Mono because IL2CPP broke their projects and they don't know why.

    Back to the original topic of this forum, I am surprised no one tried the benchmarks with Burst 1.3 yet. It is using a new llvm, a new sleef, and has some fixes that affect vectorization of more branchy algorithms.
     
    NotaNaN and sheredom like this.
  31. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    What about the 'political' impact to IL2CPP and Burst, technologies that have been shown to be X times faster than Mono will massively lose performance?

    Mind you in a lot of these benchmarks even a doubling of performance of Mono will not change it's position.

    @nxrighthere Definitely sounds like an update is needed...
     
    sheredom and NotaNaN like this.
  32. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Well, I don't have access to the machine anymore that was used originally for the tests, so I think I should re-run it twice with old and new versions of compilers to obtain a difference between the two. I'll update the results as soon as I install Unity, haven't touched it for months...
     
    NotaNaN likes this.
  33. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    I had a hard time rolling back to previous versions of compilers, so I just updated everything and re-run the benchmarks.

    versions.PNG

    Results are published as charts for convenience.

    Strange but, IL2CPP has failed the Fibonacci and Seahash. I feel like functions were just stripped out from binary for some reason, stopwatch measures only a few ticks of execution. Managed stripping level set to low and can't be disabled. Regression?

    Also, keep in mind that tests were done on a different, more powerful machine this time with Ryzen 5 1400, so numbers are radically different.
     
    Last edited: Feb 19, 2020
    hippocoder, Arowx, jdtec and 8 others like this.
  34. Vincenzo

    Vincenzo

    Joined:
    Feb 29, 2012
    Posts:
    146
    From these results we can actually see that burst was a giant waste of money and time for the unity team.
    Better spend moving the engine to .net core. The difference is neglectable and they are planning a lot of single precision math optimization for .net 5......
    Also. That would mean we don't have to work with a very small subset of the .Net language. I'm shocked.
     
    Protagonist and e199 like this.
  35. runner78

    runner78

    Joined:
    Mar 14, 2015
    Posts:
    792
    Just because burst is not much faster than .NET in some Benchmarks doesn’t mean it’s wasted time, i some Benchmarks like the pixar raytrace Bust is mutch faster. Bust is optimized for SIMD operation and RyuJIT don't have auto-vectorization.
     
    sheredom likes this.
  36. JesOb

    JesOb

    Joined:
    Sep 3, 2012
    Posts:
    1,109
    None of this test is intended for Burst compiler.

    For burst all this tests must be just not worse than others (Clang, GCC) so we see expected results.
    To see burst in action you need to test it on code it created for.

    Which is looks like waste for present day is IL2CPP.
    I wish Unity to stay on .Net 3.1 of may be .Net 5 for not bursted code and burst for High Performance in near future :) and forget IL2CPP as good tech of past days :)
     
  37. Vincenzo

    Vincenzo

    Joined:
    Feb 29, 2012
    Posts:
    146
    The difference between Ryujit and burst is becoming neglectable. And soon probably it will surpass burst.

    RYUJIT is performing amazing in benchmarks. Which are int or float heavy. Made to put compilers to the test. And in such math heavy situations burst Is not winning by any margin. And it's made for math.....

    The whole concept of writing code for burst is it's downfall. People write code to make games.

    Burst only supports 5% of the. Net framework. With the promise of amazing performance which it is not delivering versus. Net core.

    This Is a massive failure from the unity management. All this development. Time. And money. Wasted.

    If they would have put the same effort into moving to. Net core all projects today would benefit massively from the performance uplift. Without any code rewriting. Or adapting to some strange new boilerplate heavy programming.
     
    Protagonist, e199 and Lymdun like this.
  38. BrendonSmuts

    BrendonSmuts

    Joined:
    Jun 12, 2017
    Posts:
    86
    You're either being willfully ignorant of what Burst is about or you're just ignorant. DOTS is not simply about performing well in some particular math heavy benchmarks. Unity having control over the stack brings a ton of benefits for high performance game programming that aren't related to floating point math.
     
    JesOb and siggigg like this.
  39. slime73

    slime73

    Joined:
    May 14, 2017
    Posts:
    107
    Nice! I think DreamingImLatios was asking for Burst 1.3 (currently in preview) benchmarks rather than Burst 1.2, because 1.3 has a bunch of updates to the tools it uses to generate code, but these are nice to have as well.
     
  40. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,270
    Unity is a game/realtime simulation and rendering engine. It is not a general purpose .Net platform. In this industry (especially AAA) it is required to utilize the most of the hardware available in order to have the best experience to the widest market (because hardware is expensive). Well, such hardware has a few tricks that compilers have a hard time targeting because general-purpose language constructs don't align to them all that well. One of those tricks is called simd. It is not the only one.

    To target this, developers have been writing code mostly in C++ and doing everything they can to convince the compiler that it is absolutely safe to use this specialized hardware. The first step of this is autovectorization, where the compiler will pick up on loops and use the hardware. This breaks easily, and the compiler doesn't bother to tell you because it is general-purpose compiler and is more concerned about making sure the behavior is correct than performance (because in most industries, that's what matters since developer time is more costly than servers). So a lot of these C++ programmers will often have to resort to using specialize intrinsics or assembly to make the compiler more comfortable. But even then, sometimes the compiler just rejects the handcrafted (and difficult to handcraft) code and does things the slow way anyways.

    So Burst is offering two key technologies that .Net 5 does not offer. First, it brings this low-level hardware-aware programming to C#. Second, it plugs directly into llvm, effectively drugging it with metadata to make llvm as comfortable as possible with the most aggressive optimizations and specialized hardware usage. Not only that, but it is able to extract a lot more information out of the compiler related to performance and optimizations. Unlike GCC and Clang, Burst is designed specifically for targeting performance. But in order to do that, you need to write code geared towards it to give it that metadata it needs to drug llvm. That metadata manifests itself mostly by usage of the mathematics library, which is a useful library for game development in general and works across all of Unity's target platforms.

    Right now, Burst has the added novelty of converting slow mono code into fast native code, so there are significant speedups to be had for Unity users without a lot of effort. That novelty will eventually wear off as Ryujit rolls in. But by the time that happens, more of Burst's tooling for hardware-specialized optimizations will have matured and expanded. And code written targeting the specialized hardware is going to beat out the code generated by a general purpose compiler any day. The performance gap between them is only widening, thanks to hardware advances in handling this specialized code (because chip designers are running out of ways to improve the general cases).

    TL;DR, Burst is going to be relevant even when Unity adopts the .Net runtime. What it offers is more than a compiler.
     
  41. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Ah, so there's a preview version, I didn't know about this. I'll check it out, sure.
     
  42. sheredom

    sheredom

    Unity Technologies

    Joined:
    Jul 15, 2019
    Posts:
    300
    One thing I noticed while integrating this into our internal Burst benchmarking was that the arcfour benchmark compiles away to nothing on MSVC - because this line just returns the idx https://github.com/nxrighthere/BurstBenchmarks/blob/master/benchmarks.c#L841.

    Now it might be possible for Burst to emulate this optimization (we'd have to make it somehow understand that free uses a pointer in a very special way ;) ) - I think the benchmark should be updated to make it use something from each iteration of the loop. Thoughts?

    On a side note: I also noticed a missing fast-math optimization in Mandelbrot while integrating these tests and I've just landed into a future Burst version a fix for that, so once again thanks for the benchmarks - they are awesome!
     
  43. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    @nxrighthere What happens when you compile to C++ code directly using the MSVC compiler?

    At least that would show the impact of IL2CPP and Burst vs C++ on the same compiler.
     
    Last edited: Feb 21, 2020
  44. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Has Unity tested different C++ compilers as there seems to be a marked difference between them?
     
  45. sheredom

    sheredom

    Unity Technologies

    Joined:
    Jul 15, 2019
    Posts:
    300
    @Arowx aye we've tested a bunch, I just happened to be running on Windows while doing the integration and noticed that MSVC was SUPER optimized running in 0ms always :D
     
    hippocoder and nxrighthere like this.
  46. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    @sheredom Ah interesting, thanks for finding this. Feel free to commit any changes you think reasonable, I'll merge it.
     
    Last edited: Feb 28, 2020
  47. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,270
    Sorry. Yes I was. Though it is nice having 1.2 benchmarked so we can compare the differences and improvements.
     
    Paulo_Mattos and nxrighthere like this.
  48. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Well, MSVC is effectively a C++ compiler, but it supports C up to C89 with a bit of C99 features (people are asking Microsoft for years to add support for modern standards, but they don't see a value in that).

    So, with a few changes, it compiles and executes just fine except the thing that @sheredom has mentioned. I haven't added it because the benchmarks are written in C99. Also, I heard from my colleagues that /fp:fast is unstable and not well supported.
     
  49. Vincenzo

    Vincenzo

    Joined:
    Feb 29, 2012
    Posts:
    146
    @DreamingImLatios as much as you try to defend the choice unity made. I think you might have to look further than the excuses Unity made.

    .net does supply instrincts. And Simd. Look at this class for instance.
    https://docs.microsoft.com/en-us/dotnet/api/system.numerics.vector-1?view=netcore-3.1

    I don't get this idea that the unity math library is so amazing. All they keep doing at unity is reinventing the wheel.
    Steering away from .Net standards and instead make some kind of hlsl language. It's ridiculous. All of this stuff is built in at .Net core now.

    So much engineering effort is put to waste. We are forced to use a minor little subset of C#. Not even supporting standard types like arrays. Everything has to be specific written for burst. Whilst ryujit supports the whole language without limitation on similar performance.

    I repeat. If unity would have spend all this development into moving unity to .Net core all our projects would have benefits.
    With the extra time saved they could have also pushed the .Net 5 project along. If they would have missed anything.

    The final nail in the burst coffin will be the addition of the LLVM JIT and AOT compiler to .Net 5. Burst is at that point offering nothing more than downsides.

    The business choices made by unity have been a giant mistake. And we as paying professional companies with real production games should speak up.
     
    Last edited: Feb 22, 2020
    Protagonist, Lymdun and e199 like this.
  50. Lymdun

    Lymdun

    Joined:
    Jan 1, 2017
    Posts:
    46
    nxrighthere and hippocoder like this.
Thread Status:
Not open for further replies.