Search Unity

How well does the Burst compiler batch/unroll jobs?

Discussion in 'Burst' started by Arowx, Dec 18, 2018.

  1. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    If you were writing a multi-threaded routine that takes advantage of SIMD instructions at the lowest level you will be batching calculations and unrolling loops to use the full bandwidth of your SIMD instruction set.

    You need to do this as sometimes loading and running SIMD instructions has an overhead/cost.

    In Unity if your ECS Job just moved a game entity, so adds one Vector2 or Vector3 to another, is Burst and Jobs/ECS smart enough to do multiple batched instructions per cycle?
     
  2. 5argon

    5argon

    Joined:
    Jun 10, 2013
    Posts:
    1,555
  3. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Hmm is this what [BurstCompile(CompileSynchronously = true)] does?

    or is it the

    [BurstCompile(Accuracy.Med, Support.Relaxed)]

    or the use of

    Unity.Mathematics float2, float3

    that will allow for better batch SIMD instruction loading?
     
  4. eizenhorn

    eizenhorn

    Joined:
    Oct 17, 2016
    Posts:
    2,683
  5. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    I looked at the manual and I think it could use some better explanations...

    So it sounds like you need to use .Mathermatics.
    And using relaxed can improve SIMD...?

    Then what does this do...

    Do asynchronously Burst Jobs allow multiple batched/unrolled ops per call?

    And under known issues...

    Does this limit the SIMD instruction set and optimisations available as SSE4.2 has SIMD string functions https://en.wikipedia.org/wiki/SSE4

    What about AVX/2 insturuction sets? https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
     
    Last edited: Dec 18, 2018
  6. Zuntatos

    Zuntatos

    Joined:
    Nov 18, 2012
    Posts:
    612
    using Mathermatics makes it more likely to use SIMD operators (basically guarantued) versus hoping the compiler simd-ifies your float code.

    relaxed compiler allows changing the order of operators to optimize more. float point operations are not as re-orderable as you'd expect due to precision rounding. Random example is to allow reordering from "a * 0.5 + b * 0.5" to "(a + b) * 0.5" (not an actual example).

    CompileSynchronously is just a way to disable the threaded building in case something goes wrong I guess. It shouldn't change anything to the code created.

    If the build is hardcoded to SSE4 and below, then yes you won't have SSE4.2 or AVX / AVX2 / AVX512 instructions. I expect them to build up something that'll automatically compile the function at multiple levels and choose the max level supported at runtime.
     
    JesOb and eizenhorn like this.
  7. eizenhorn

    eizenhorn

    Joined:
    Oct 17, 2016
    Posts:
    2,683
    As I remember it's only for compilation and not correspond with SIMD
     
  8. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Thank you I think the documentation could use a bit of improving with a bit more of an explanation of these topics.

    And definitely some examples on how best to use these features, as well as a breakdown of what instructions the Burst compiler can use.

    However my main point is there could be low computation ECS systems that will be run one at a time, when in actual fact the CPU has the computation bandwidth in a core to run multiple algorithms in one step.

    Can Burst run more than one computation per cycle an example of this could be a system that animates a mesh could animate multiple points per cycle vastly boosting the performance?
     
  9. 5argon

    5argon

    Joined:
    Jun 10, 2013
    Posts:
    1,555
    Deducing purely from the documentation, it says thanks to no alias analysis it is able to vectorize while linearly iterating NativeArray, and only exactly this native container NativeArray. If each point is in NativeArray, then yes it can. And with chunk iteration you get NativeArray, meaning that NativeArray is the backend of everything. So everything fits what the documentation promised. What leads you to doubt Burst cannot do that?
     
  10. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Is that if you are iterating on a NativeArray within a system or that a simple MoveSystem that adds two vectors would automatically run in multiple steps per cycle when hardware SIMD features have the bandwidth?
     
  11. 5argon

    5argon

    Joined:
    Jun 10, 2013
    Posts:
    1,555
    The docs did not say such a specific thing, but it is easy to find out. Why not just try make a simple job that adds native array of vector and see the assembly for yourself?

    Look for commands with "ap" or "p" like vmovaps, vadd/subps. A = aligned P = packed. If you found some then, that's the answer you seek.