Search Unity

Benchmarking Burst/IL2CPP against GCC/Clang machine code (Fibonacci, Mandelbrot, NBody and others)

Discussion in 'Burst' started by nxrighthere, Jul 23, 2019.

Thread Status:
Not open for further replies.
  1. JesOb

    JesOb

    Joined:
    Sep 3, 2012
    Posts:
    1,109
    Yes you are right you just dont understand how computer works and why Unity create new tech in a way they do it.
    you can read more about how computer cpu works :) Many people dont agree with you math library because it now way more useful and adding this usefulnes into standart .net one is hard or just impossible in efficient way. .Net is about OOP, not about maximum performance, but about good performance with minimal effort.
    Burst is about max performance with right code and it very good in this. Just read how all DOTS is done and why everything available in unity blogs.

    You wrong on highlighted words, there is no similar performance between them. This is yet another part of your incomprehension in max performance, how to achieve it and that tests above is not about it.

    Totally agree but when unity start create new tech even .net core 3.1 was absent and only .net 5 has parity in features with .Net framework so Unity can go only to .net 5 without destroying most assets on asset store.

    This is very good tech of future and it really can replace IL2CPP and be good complement to burst tech.
    You just not smart enough for today to understand this, just read more about what is burst and what is it goal and benefits.

    The same as above :) Just read more before speak and you will be not so ridiculous :)
     
    stuepfnick and RaL like this.
  2. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    OK would it be possible to take the IL2CPP C++ code Unity generates and run it through other compilers, as this would give us the same comparison of MSVC vs Clang vs GCC?

    Would this also give us a great benchmark for comparing the impact of IL2CPP and Burst?
     
  3. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,267
    Alright. Let's be more civil here. We can test claims here.

    I wrote an article about an actual algorithm I optimized using the tools within the Burst package to get the absolute most out of the hardware (It's not 100% optimized yet, but it provides a good insight into the workflow). https://github.com/Dreaming381/Lati...ptimization Adventures/Find Pairs - Part 1.md This was written with Burst 1.2. Burst 1.3 has some fixes that improve some of the variants I wrote.

    Anyways, for everyone who believes that Burst was a waste of time compared to .Net 5, I would like to know how I would do such kinds of optimization in .Net 5 and what the output assembly looks like. This is not me trying to prove I am right. If I am wrong, I am genuinely curious what those tools in .Net 5 are!

    @sheredom I've been writing the next article in the series lately and I ran into a bizarre issue where simply converting from AoS to SoA was taking 33% longer than converting from SoA to AoS with transforms applied to half the data. I've narrowed this down to the writes being slow, but I am still not sure why. I'm suspicious this might be related to AVX and store buffers not having correct alignment or something. Is this something you would like an early look at before I release it in mid-March? I totally understand if you don't have time.
     
  4. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Well, I think Unity could provide an option to choose a compilation toolchain for IL2CPP (MSVC or Clang), but not sure if this will carry any benefits. GCC, for example, is rarely used in the game development industry.
     
  5. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    That would be cool... @Unity optional c-compiler?

    I wonder if Unity has considered zapcc a caching C++ compiler, which claims to massively boost build times 2x-5x and re-build times 10x-50x.
     
  6. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Yes, compilation time is one of the biggest and annoying problems of C++, especially in a complex codebase. The upcoming modules feature in C++20 might significantly improve it, so IL2CPP should benefit from it as well. Will see.
     
  7. sheredom

    sheredom

    Unity Technologies

    Joined:
    Jul 15, 2019
    Posts:
    300
    Yeah feel free to loop me in. One thing with AVX is it is super easy to become load/store bound because of the huge vectors you are streaming in/out (much more so than with SSE). I'd be interested if you had a repro project I could look at though!
     
    DreamingImLatios likes this.
  8. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    It sounds like modules introduce a compiled before dependency in builds one that can exclude parallelism from the build pipeline C++ Modules Might Be Dead-on-Arrival.

    zapcc appears to just use in memory caching of build files that provides a big boost in performance as it excludes the file io times.

    If modules can be cached in memory and the issue with compile order dependencies is solved then they could be amazingly fast.

    On the other hand if Unity has an option to specify a build cache location then developers could just use a RAM drive. This could provide massive boosts in build times as long as developers have enough RAM.

    As even modules only speed up C++ compilation not other build targets e.g. Mono/Webassembly
     
    nxrighthere likes this.
  9. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    I just finished reading the box-pruning series by Pierre. Not a fan of mastering and optimizing algorithms, but this one has a simple broadphase. Here's what you can use in .NET Core 3.0 or higher to reproduce the optimization steps:

    1. Analysis of JIT ASM
    If you want to know what JIT is producing from your code, you can use Disasmo, JitBuddy, JitDasm, or online tool SharpLab. It will give the required insights, use release builds and disable tiered compilation to get a proper ASM.

    2. Straightforward hardware acceleration
    Vector types in System.Numerics gives you an easy way to get a decent performance boost (in comparison to naive implementation) without any external effort. Just check if Vector.IsHardwareAccelerated indicates that vector operations are subject to hardware acceleration, RyuJIT will do the rest. You can also use a generic Vector<T> for primitive numeric data types that are directly supported by the CPU.

    3. Advanced SIMD optimization
    If you want to get more control over SIMD instructions, you can use System.Runtime.Intrinsics. This should be enough to squeeze as much performance as possible out of your CPU in the JIT environment, but this overcomplicates code significantly. Here's an example of AVX-optimized summation.

    4. Unmanaged memory
    If at some point GC makes you unhappy and you want to get more control over memory at the cost of safety, consider the unmanaged allocator such as rpmalloc. This one is much faster than Unity's TLSF and even faster than stackalloc for memory blocks of >= 1024 bytes. You can mix it with Span<T> to get some safety for processing memory, and it can be mixed with hardware-accelerated vector types as well as with SIMD abstraction layer out of the box.

    I hope it helps.
     
    Last edited: Mar 7, 2020
    bb8_1, SenseEater, Vincenzo and 3 others like this.
  10. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,267
    It is simple, but it is still way faster than what most people do, and has a lot of practical use cases outside of Physics. One of the main draws to Data-Oriented Design outside of performance is that the logic can freely reinterpret the semantics of the data, allowing general purpose algorithms to be much more reusable out of the box. I don't optimize every algorithm, just the ones I use all over the place.

    Very helpful! Thank you!
     
    MegamaDev and nxrighthere like this.
  11. liiir1985

    liiir1985

    Joined:
    Jul 30, 2014
    Posts:
    147
    coreclr won't be an option in the near future, simply because it won't run on any non JIT platforms like most consoles and iPhone, it can't be that you asume that unity only runs on desktops, right? .Net core performs well on x86, but not even close on other cpu architectures. It still has a long way ahead since the JIT engine and AOT engine must be hand written and optimized for every single platform it would support, this is where LLVM provides great benifits! It's true the x86 codegen of RyuJIT did amazing job, but arm32 codegen is still preview and arm64 is still alpha at the time I last check it, and it itself didn't manage to provide any AOT solution so far, and corert won't be avaliable in the near future. It's most unlikely Microsoft would implement support for more game console platforms other than XBox or even improve its performance on such platform, it leaves Unity no choice but implement its own IL2CPP, and hence the dear old Boehm GC, because it's simply the only possible solution.

    Unity has many talented developers, and it's impossible that they would just ignore the perfect solution if it exists. So it's not waste of money for making Burst or even IL2CPP at all, but life critical for Unity.

    Really, if you ever investigated into this area, then you'll know that IL2CPP saved the life of millions of game developer in the world, and Burst will continue to greatly enhance the ecosystem further more. Otherwise, since Apple has forced developer to deliver 64bit back in the 2014, Unity's dead. IL2CPP + Burst is the only solution to bring all these nice features and performance improvements to all the platforms Unity supports
     
    Last edited: Mar 6, 2020
  12. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    What about the potential underutilisation of code cache combined with data bandwidth hogging potential of smaller but potentially more re-usable Systems e.g.

    A Move System could just add two float3's, massively under utilising the code caches but hogging the RAM bandwidth of your system. ​

    Only for other systems to then re-use that data over and over within a frame. e.g. Control/Collision/Proximity/Physics systems will then reuse the same data often.

    Or how can you profile your systems to work out what systems could be combined to gain the most processing from your available bandwidth?
     
  13. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,267
    It'd be cool if Burst had attributes like [HotLoop] or [ColdBranch] (Dang it C# language!) that could help the code compile into something cleaner in this regard.

    I don't write algorithms at that level of granularity. The reusable algorithms tend to be non-trivial (at least non-trivial enough that I wouldn't be surprised when a junior-level programmer screws it up). Spatial queries, statistics, searching, sorting, culling, and FFT just to name a few.

    Algorithm-mixing to solve bandwidth problems is a cool concept but difficult to do in practice because even the ALU-heavy algorithms tend to require a fair amount of bandwidth. A lot of this ends up being trial-and-error. My next optimization article covers identifying and working with a bandwidth-bound problem. I'm hoping to get it out sometime this month.
     
  14. Ramobo

    Ramobo

    Joined:
    Dec 26, 2018
    Posts:
    212
    Could you also add CoreRT benchmarks?
     
  15. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    I can take a look at what I can do, but as far as I know, Microsoft has no plans to productize it. I think it's better to wait for later .NET 5 releases where the company will provide more information about AOT. By the way, the first preview release is available.
     
  16. Ramobo

    Ramobo

    Joined:
    Dec 26, 2018
    Posts:
    212
    I know about the .NET 5 preview. Even if Microsoft has no plans to productize CoreRT, it's there. Might as well see how well IL2CPP does against that.
     
  17. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    I've tried to compile the code using the latest .NET Core with CoreRT, but the compiler just hangs for some reason, not sure how to debug this.
     
  18. Ramobo

    Ramobo

    Joined:
    Dec 26, 2018
    Posts:
    212
    Well, damn. Can reproduce.
     
  19. hippocoder

    hippocoder

    Digital Ape

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    Thread is closed due to Unity now having this testing internally. If you have an ongoing conversation please PM @nxrighthere thanks.
     
    MegamaDev, sheredom and nxrighthere like this.
Thread Status:
Not open for further replies.