How well does the Burst compiler batch/unroll jobs?

Arowx · Dec 18, 2018

If you were writing a multi-threaded routine that takes advantage of SIMD instructions at the lowest level you will be batching calculations and unrolling loops to use the full bandwidth of your SIMD instruction set.

You need to do this as sometimes loading and running SIMD instructions has an overhead/cost.

In Unity if your ECS Job just moved a game entity, so adds one Vector2 or Vector3 to another, is Burst and Jobs/ECS smart enough to do multiple batched instructions per cycle?

5argon · Dec 18, 2018

It is smart enough : https://docs.unity3d.com/Packages/com.unity.burst@0.2/manual/index.html#vector-types , https://docs.unity3d.com/Packages/com.unity.burst@0.2/manual/index.html#memory-aliasing-and-noalias

Arowx · Dec 18, 2018

Hmm is this what [BurstCompile(CompileSynchronously = true)] does?

or is it the

[BurstCompile(Accuracy.Med, Support.Relaxed)]

or the use of

Unity.Mathematics float2, float3

that will allow for better batch SIMD instruction loading?

eizenhorn · Dec 18, 2018

Arowx said: ↑

Hmm is this what [BurstCompile(CompileSynchronously = true)] does?

or is it the

[BurstCompile(Accuracy.Med, Support.Relaxed)]
Click to expand...

https://docs.unity3d.com/Packages/com.unity.burst@0.2/manual/index.html#compiler-options

Arowx · Dec 18, 2018

eizenhorn said: ↑

https://docs.unity3d.com/Packages/com.unity.burst@0.2/manual/index.html#compiler-options
Click to expand...

I looked at the manual and I think it could use some better explanations...

So it sounds like you need to use .Mathermatics.

The Unity.Mathematics provides vector types (float4, float3...) that are directly mapped to hardware SIMD registers.

Also, many functions from the math type are also mapped directly to hardware SIMD instructions.

Note that currently, for an optimal usage of this library, it is recommended to use SIMD 4 wide types (float4, int4, bool4...)
Click to expand...

And using relaxed can improve SIMD...?

The compiler relaxation is defined by the following enumeration:

public enum Support
{
Strict,
Relaxed
}

Strict: The compiler is not performing any re-arrangement of the calculation and will be careful at respecting special floating point values (denormals, NaN...). This is the default value.

Relaxed: The compiler can perform instructions re-arrangement and/or using dedicated/less precise hardware SIMD instructions.

Typically, some hardware can support Multiply and Add (e.g mad a * b + c) into a single instruction. Using the relaxed calculation can allow these optimizations. The reordering of these instructions can lead to a lower accuracy.

Using the Relaxed compiler relaxation can be used for many scenarios where the exact order of the calculation and the uniform handling of NaN values are not strictly required.
Click to expand...

Then what does this do...

By default, the burst compiler in the editor will compile the jobs asynchronously.

You can change this behavior by setting CompileSynchronously = true for the [BurstCompile] attribute:
Click to expand...

Do asynchronously Burst Jobs allow multiple batched/unrolled ops per call?

And under known issues...

The target CPU is currently hardcoded per platform (e.g SSE4 for Windows 64 bits)

Click to expand...

Does this limit the SIMD instruction set and optimisations available as SSE4.2 has SIMD string functions https://en.wikipedia.org/wiki/SSE4

What about AVX/2 insturuction sets? https://en.wikipedia.org/wiki/Advanced_Vector_Extensions

Zuntatos · Dec 18, 2018

using Mathermatics makes it more likely to use SIMD operators (basically guarantued) versus hoping the compiler simd-ifies your float code.

relaxed compiler allows changing the order of operators to optimize more. float point operations are not as re-orderable as you'd expect due to precision rounding. Random example is to allow reordering from "a * 0.5 + b * 0.5" to "(a + b) * 0.5" (not an actual example).

CompileSynchronously is just a way to disable the threaded building in case something goes wrong I guess. It shouldn't change anything to the code created.

If the build is hardcoded to SSE4 and below, then yes you won't have SSE4.2 or AVX / AVX2 / AVX512 instructions. I expect them to build up something that'll automatically compile the function at multiple levels and choose the max level supported at runtime.

eizenhorn · Dec 18, 2018

Arowx said: ↑

I looked at the manual and I think it could use some better explanations...

So it sounds like you need to use .Mathermatics.

And using relaxed cal improve SIMD

Then what does this do...

Do asynchronously Burst Jobs allow multiple batched/unrolled ops per call?
Click to expand...

As I remember it's only for compilation and not correspond with SIMD

Arowx · Dec 19, 2018

Zuntatos said: ↑

using Mathermatics makes it more likely to use SIMD operators (basically guarantued) versus hoping the compiler simd-ifies your float code.

relaxed compiler allows changing the order of operators to optimize more. float point operations are not as re-orderable as you'd expect due to precision rounding. Random example is to allow reordering from "a * 0.5 + b * 0.5" to "(a + b) * 0.5" (not an actual example).

CompileSynchronously is just a way to disable the threaded building in case something goes wrong I guess. It shouldn't change anything to the code created.

If the build is hardcoded to SSE4 and below, then yes you won't have SSE4.2 or AVX / AVX2 / AVX512 instructions. I expect them to build up something that'll automatically compile the function at multiple levels and choose the max level supported at runtime.
Click to expand...

eizenhorn said: ↑

As I remember it's only for compilation and not correspond with SIMD
Click to expand...

Thank you I think the documentation could use a bit of improving with a bit more of an explanation of these topics.

And definitely some examples on how best to use these features, as well as a breakdown of what instructions the Burst compiler can use.

However my main point is there could be low computation ECS systems that will be run one at a time, when in actual fact the CPU has the computation bandwidth in a core to run multiple algorithms in one step.

Can Burst run more than one computation per cycle an example of this could be a system that animates a mesh could animate multiple points per cycle vastly boosting the performance?

5argon · Dec 19, 2018

Arowx said: ↑

Can Burst run more than one computation per cycle an example of this could be a system that animates a mesh could animate multiple points per cycle vastly boosting the performance?
Click to expand...

Deducing purely from the documentation, it says thanks to no alias analysis it is able to vectorize while linearly iterating NativeArray, and only exactly this native container NativeArray. If each point is in NativeArray, then yes it can. And with chunk iteration you get NativeArray, meaning that NativeArray is the backend of everything. So everything fits what the documentation promised. What leads you to doubt Burst cannot do that?

Arowx · Dec 19, 2018

5argon said: ↑

Deducing purely from the documentation, it says thanks to no alias analysis it is able to vectorize while linearly iterating NativeArray, and only exactly this native container NativeArray. If each point is in NativeArray, then yes it can. And with chunk iteration you get NativeArray, meaning that NativeArray is the backend of everything. So everything fits what the documentation promised. What leads you to doubt Burst cannot do that?
Click to expand...

Is that if you are iterating on a NativeArray within a system or that a simple MoveSystem that adds two vectors would automatically run in multiple steps per cycle when hardware SIMD features have the bandwidth?

5argon · Dec 19, 2018

The docs did not say such a specific thing, but it is easy to find out. Why not just try make a simple job that adds native array of vector and see the assembly for yourself?

Look for commands with "ap" or "p" like vmovaps, vadd/subps. A = aligned P = packed. If you found some then, that's the answer you seek.

Search Unity

Unity ID

Useful Searches

How well does the Burst compiler batch/unroll jobs?