Search Unity

Benchmarking Burst/IL2CPP against GCC/Clang machine code (Fibonacci, Mandelbrot, NBody and others)

Discussion in 'Burst' started by nxrighthere, Jul 23, 2019.

Thread Status:
Not open for further replies.
  1. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    If you fix compile bugs for code you know is not professional how can you expect professional level code to work artifact free, surely you should test on the best optimised solutions as well as threaded SIMD versions of the same code vs Burst?
     
  2. xoofx

    xoofx

    Unity Technologies

    Joined:
    Nov 5, 2016
    Posts:
    417
    Sure. My point is really at the condition that there is an equivalent C version using vector types, otherwise it is not super helpful. If you are pushing a SIMD version on the side for both, fair enough (@nxrighthere is likely going to accept a PR for this)
     
    nxrighthere likes this.
  3. JesOb

    JesOb

    Joined:
    Sep 3, 2012
    Posts:
    1,109
    Looking to performance comparison I think of using GCC and Clang side by side.
    Compile every job by 2 compilers and allow to choose more performant one.
     
  4. xoofx

    xoofx

    Unity Technologies

    Joined:
    Nov 5, 2016
    Posts:
    417
    Burst is using the lower compiler infrastructure LLVM not Clang directly. Unfortunately, there is no equivalent of LLVM in GCC.
     
  5. JesOb

    JesOb

    Joined:
    Sep 3, 2012
    Posts:
    1,109
    I understand :)
    Than happy upgrade to newer version and happy optimization of Burst :) You have done Amazing job :)
     
  6. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Can Burst chunk up loops e.g. float4 a float for loop and step the loop +4?
     
  7. xoofx

    xoofx

    Unity Technologies

    Joined:
    Nov 5, 2016
    Posts:
    417
    Yes, but only in the case when memory aliasing is safe
    Which is not the case in these small benchmarks, unless pointers parameters are annotated with the [NoAlias] attribute (same for the C version that don't use __restrict pointers)
    But the auto-vectorizer of LLVM is not perfect and it could fail easily if it finds something that it doesn't handle well when trying to vectorize a loop
     
    nxrighthere likes this.
  8. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    And how vectorized math related to this?
     
  9. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    OK take the 'Pixar' Raytracer example, every time it Samples a point it has to convert the byte array data for the letters back into vector points and check all 60 of the points against the current point being checked.

    It would take an amazing AI compiler to figure out it just needs to convert these points to vectors once and store them for the check to get a massive speed up.

    Also vectorisation would only kick in once the data has been converted to float3 so the original loop using bytes might even prevent compilers from adopting this optimisation.

    Which is what I did in the faster version of the benchmark, although I ended up using Native Arrays, instead of pre-calculating all the point data and pasting it back into the code base as a float3 array (Unity Jobs gets very fussy about array data).

    But ditto for any benchmark where the grouping and relation of the data and algorithm are very low e.g. hand coded vector operations using floats as opposed to float3 mathematic functions.
     
  10. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    What you are saying is implementation-specific optimizations to get a performance boost in a particular case. I told you (and Alex) several times that the goal here is different, the goal is not to improve the performance of algorithms before they being compiled, but to find how well a compiler itself can handle the same codebase between the two languages under the same conditions. Slow these algorithms or fast doesn't matter what is matter is how compilers handle the code across various tests.
     
    hippocoder likes this.
  11. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    If you want to make these tests to benefit from vectorized math, a PR is welcome, but for both languages with an equivalent implementation. SLEEF is available on GitHub which Unity is using.
     
    hippocoder likes this.
  12. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    No I'm saying poor/sub-optimal implementations do nothing for testing compilers they just make it near impossible for the compiler technology to bring in good optimisation.

    On the flip side they make for good compiler test cases to see how in extremely obtuse examples the compiler is.

    Good programmers will be able to meet the compiler half way with easy to optimise/vectorise code.

    But don't take my word for it you should produce a range of versions of the benchmarks so we programmers can work out how best to get the most from them IMHO.
     
  13. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Then you don't know anything about the compilers (which already obvious from your previous questions). A compiler has no conception what is a good or bad implementation.
     
    Last edited: Jul 28, 2019
    xVergilx and hippocoder like this.
  14. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Exactly my point we should try different versions of the same algorithms to learn what makes the best improvements.

    Something you can actually see in the improvements I have managed to make in Burst benchmarks.

    Then ideally Unity and the compiler programmers can work out how to improve their technology.
     
  15. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    One more time, it's beyond the purpose of these tests. If you are interested in optimization in the scope of the algorithms themselves, then fork the repository, remove the C code and use it as a basis for your experiments.
     
    alexzzzz, xVergilx, pvloon and 5 others like this.
  16. hippocoder

    hippocoder

    Digital Ape

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    Arowx, feel fee to clone and change the experiment/testing here, but this thread is not for what you are trying to impose on it.
     
  17. TurboNuke

    TurboNuke

    Joined:
    Dec 20, 2014
    Posts:
    69
    This has become a bizarre thread! Arowx, you're trying to compare oranges in an apples discussion.
     
  18. Shinyclef

    Shinyclef

    Joined:
    Nov 20, 2013
    Posts:
    505
    To be fair, there's value in measuring both styles. We can learn how the compiler copes in apples to apples comparisons, but we can also learn what we can do as developers to improve performance. But they are indeed different things being measured and the focus of this thread is on the compiler it seems.
     
    MegamaDev and nxrighthere like this.
  19. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    I've added a new Fireflies Flocking benchmark. This is a minified flocking simulation essentially with separation and cohesion. Since there are many boids (1,000 by default) I've used the persistent allocator in Unity and a portable _mm_malloc() in C, not sure how much impact Unity's TLSF makes there, but shouldn't be noticeable in this test.

    Here's what I got: FloatPrecision.Standard, FloatMode.Fast reduces performance up to 35% in comparison with the default mode. Clang is slightly faster than GCC in this test, but Burst shows even better result.
     
    Last edited: Jul 29, 2019
  20. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Also, I'm experimenting with another benchmark using NoAlias and restrict keyword, I think the only way to properly measure the impact of these hints is to prevent method inlining?
     
    Last edited: Jul 31, 2019
  21. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    I shall consider doing that as I think the current approach here is missing out on the chance to see what DOTS and Burst can do and explore how to get the best out of the technology.

    If I were to convert the mathematics library to C/C++ we could compare oranges with oranges.
     
  22. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    You don't need to convert anything, I've mentioned SLEEF here twice. The only difference is that Burst applies SLEEF intrinsics after translation to LLVM, you can use the functionality of the library natively as well as achieve the non-aliasing memory access (that DOTS abstractions are doing for you transparently).
     
  23. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    I'm seeing a 1-2% difference between consecutive runs, should you run these benchmarks multiple times and supply the average as the result?
     
  24. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    I always run tests multiple times and take average numbers.
     
  25. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Just thought you would include it in the benchmark code save doing the math yourself.
     
  26. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Indeed, this is a good idea, will implement this a bit later.
     
  27. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    I've made some fixes to the flocking simulation and updated this post. Actually, FloatPrecision.Standard, FloatMode.Fast again reduces the overall performance. With the default float mode Burst shows a better result than GCC and Clang. So there's definitely something wrong with floating point optimizations.
     
    Last edited: Jul 29, 2019
  28. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Found the root of the problem - NaNs, they cause this significant performance degradation when the fast float mode is enabled since optimizations assume that results and arguments are non-NaNs. I've reduced a range in Marsaglia's random number generator, and it affected all compilers since it also affects the algorithms themselves. Burst is now closer to GCC by 20% relative to previous results and the attribute is used to improve performance.
     
  29. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Have you considered using operator overloading for your vector functions, it would make things easier to read and write as well as allowing developers to switch between Vector, float3 and float4 with ease.
     
  30. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    C doesn't support this, so I'm not using any language-specific differences between the two.
     
  31. jamespaterson

    jamespaterson

    Joined:
    Jun 19, 2018
    Posts:
    399
    Thanks for this thread it is interesting. I apologise if this is a silly question but can i please ask if anyone can confirm that by default in current (e.g. 2018/19) unity it does not use any simd instructions for vector matrix math? That the only official way to achieve this is via the burst compiler?
     
  32. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Which API exactly?
     
  33. jamespaterson

    jamespaterson

    Joined:
    Jun 19, 2018
    Posts:
    399
    Thanks. For example when using operations such as scalar multiplication on Vector classes? Or rotation by a Quaternion etc?
     
  34. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    They have there very limited SSE intrinsics support (matrices multiplication based on __m128 data type, for example), but I don't think that it gives noticeable benefits. Burst-optimized Unity.Mathematics have way much more intrinsics behind the scene than traditional API (in a compartment with homologous changes to your code of course).
     
  35. jamespaterson

    jamespaterson

    Joined:
    Jun 19, 2018
    Posts:
    399
    ok thanks - much appreciated
     
  36. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    I thought you were using compilers with C++ support doesn't that have operator overloading?
     
  37. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    C and C++ are two different languages with different compilers, semantics, syntax, and so on. They have some historical similarities, but unlike C++, C is a structured programming language where data and functions are separated naturally into multiple blocks of execution, object-oriented concepts there are not a thing. When I need OO features, I prefer C# over C++ (until it becomes necessary to use C++, like a job).

    smile.jpg
     
  38. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    It just seems odd that your limiting yourself to C and comparing it to C# when both sets of programming benchmarks could benefit from operator overloading at least. And I do wonder if the C compilers might not be able to optimise better with operator overloading on numerical calculations than on function calls. After all C is old hat now so I would expect compiler developers to be more focused on C++ improvements.

    Worth testing out to see if you see a difference between C and C++?
     
  39. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Added Polynomials benchmark, all other tests were re-runned with updated Burst to 1.1.2, IL2CPP and Mono to Unity 2019.2.0f1, Clang to LLVM 8.0.1. Numbers in the flocking simulation was noticeable changed due to some fixes and increased density of boids. Notice that IL2CPP is now faster than Burst in the raytracer and polynomials.
     
    NotaNaN and sngdan like this.
  40. sheredom

    sheredom

    Unity Technologies

    Joined:
    Jul 15, 2019
    Posts:
    300
    Hey @nxrighthere - I've been looking over these benchmarks to see if there is anything obvious that Burst is missing, and I found what I think is a stack overflow in SieveOfEratosthenes in both the .cs and .c:

    The stack array flags is allocated to 1024 bytes, but the code indexes into the 1025'th element. Either the flags array needs to be size + 1, or the loop conditions need to be changed to be < size rather than <= I think.

    See:
    https://github.com/nxrighthere/BurstBenchmarks/blob/master/Benchmarks.cs#L315
    https://github.com/nxrighthere/BurstBenchmarks/blob/master/benchmarks.c#L234

    Thanks for these benchmarks though - super cool that y'all are pushing Burst to its limits :)
     
    stuepfnick and nxrighthere like this.
  41. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    @sheredom Ah, thanks, I haven't noticed this. There might be other human factor mistakes, I think it would be better if I attach something like Valgrind/Address Sanitizer for code analysis.

    I'll push more tests to the repository, just a bit slammed with another work. :rolleyes:
     
    Last edited: Aug 17, 2019
    sheredom likes this.
  42. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Added Particle Kinematics, this one utilizes persistent memory allocator for 1,000 particles. Burst is slightly faster than Clang in this test, but IL2CPP performs even better. GCC is much ahead.

    Single-precision math tests completed. Integer math is next.
     
    Last edited: Aug 17, 2019
    NotaNaN likes this.
  43. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    I was very busy the last month, so delayed the update, but going to finish this quickly now.

    Added Arcfour benchmark, Burst and Clang show almost the same results, GCC is 14% faster than Burst in this test, IL2CPP has worthy numbers. Going to try some fast hashing algorithm next, I think.

    Also, thanks to people who contributed on Bountysource, I highly appreciate that, cheers. ;)
     
    Last edited: Sep 12, 2019
    Grizmu and hippocoder like this.
  44. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Added Seahash, this time all compilers show almost identical results. In general, integer math is more accurate and predictable than floating-point and this is crucial for those who trade performance in favor of determinism.
     
  45. hippocoder

    hippocoder

    Digital Ape

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    Is it a realistic exception for Burst to gain parity with GCC for the worst tests, or is it the nature of the beast? I realise there's probably a ways to go but 86% is just cruel.
     
  46. Grizmu

    Grizmu

    Joined:
    Aug 27, 2013
    Posts:
    131
    Luckily he meant that GCC is just 16% quicker than Burst in this test, so it's not that much of a difference:
    Integer math - Arcfour:
    Burst - 97,695,014 ticks (100%)
    GCC - 84,442,342 ticks (86%)
     
    nxrighthere likes this.
  47. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Yes, it's faster by 14%, just wrote the wrong number...
     
  48. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    5,203
    >Is it a realistic exception for Burst to gain parity with GCC for the worst tests, or is it the nature of the beast? I realise >there's probably a ways to go but 86% is just cruel.

    Our expectation is that bursted code should be faster or equal to C++ in all cases.

    That said, like i said before these benchmarks are not aligned with how optimised code looks like. They are not a good representation of what happens in a game where you want to get good performance.

    That doesn't mean it's completely useless, this kind of scalar code does exist in the real world of course and our goal is to be as fast or faster in all cases. That work is ongoing, with every release we improve.

    But it is important to note that the type of code that game developers write when performance is important (Actually using SIMD) is already as fast or better than C++ and is not represented in this benchmark.
     
  49. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    567
    Generally, Burst is quite well optimizing the code, in some cases even better than the more recent version of Clang itself. It's hard to beat GCC at floating-point operations since it's optimizing them way more aggressively than any other compiler in fast mode. Vectorized math will eliminate this difference or make it insignificant.
     
  50. hippocoder

    hippocoder

    Digital Ape

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    Awesome, that's basically no practical difference. This is amazing when you consider how short a time Burst has been in development vs GCC.
     
Thread Status:
Not open for further replies.