Burst, SIMD and float3 / float4 - best practices

korzen303 · Apr 19, 2018

Hi, I am preparing a tutorial on JobSystem and Burst. I did some simple Burst performance measures using different approaches to float3/4 handling in terms of SIMD. I was always confused when writing algorithms and mixing these two types. I wasn't sure whether there are some extra operations added due to swizzles, data loading etc.

Although, it is a simple synthetic test, maybe some of you will find it useful.

In the table are time measures in ms for 100k, 1M, 10M elements in data arrays and instructions count in the brackets

Instructions Count
Float3 (18)
0.28ms 2.19ms 21.8ms

Float4to3 (15)
0.28ms 2.62ms 26.1ms

Float4 (14)
0.28ms 2.64ms 26.1ms

Also this is interesting in the context of the GPU but I guess also valid for SIMD.

It's generally best to use the smallest type that holds your data.

Using float4 for everything makes it harder for the compiler/driver to duel issue instructions.

Modern GPUs can do a float3 and a scaler or two float2 ops together. If all your stuff is float4 even if you are only using 2 or 3 components then you are possibly losing out on some optimizations that the compiler/driver can do.

You should also specify the swizzles in every case. (e.g. v3Result.x = v3Pos.x * fXOffset instead of v3Result = v3Pos.x * fXOffset)
Click to expand...

https://www.gamedev.net/forums/topi...etimes-float3/?do=findComment&comment=3188515

The question is whether the Burst compiler can also combine different ops together and what are the best practices regarding this topic, in general. If someone from Unity (@Joachim_Ante?) could comment on this, that would be much appreciated.

First, just for the reference, a scalar version. 14 assembly instructions

Code (CSharp):

[ComputeJobOptimization]

struct Float1Job : IJob

{

public int dataSize;

[ReadOnly] public NativeArray<float> dataA;

[ReadOnly] public NativeArray<float> dataB;

[WriteOnly] public NativeArray<float> dataOut;

public void Execute()

{

for (int i = 0; i < dataSize; i++)

{

float a = dataA[i];

float b = dataB[i];

float sum = a + b;

float mul = a * b;

float res = (sum - mul) / 10.0f;

dataOut[i] = res;

}

}

}

Code (CSharp):

mov r8, qword ptr [rcx + 8]

mov r9, qword ptr [rcx + 64]

movss xmm1, dword ptr [r8 + rax]

movss xmm2, dword ptr [r9 + rax]

movaps xmm3, xmm2

addss xmm3, xmm1

mulss xmm2, xmm1

subss xmm3, xmm2

mulss xmm3, xmm0

mov rdx, qword ptr [rcx + 120]

movss dword ptr [rdx + rax], xmm3

inc r10d

add eax, 4

cmp r10d, dword ptr [rcx]

Next, both data in calculations using float3s. We get 18 instructions, a few extra insertps and extractps instructions to prepare the data for SIMD:

Code (CSharp):

[ComputeJobOptimization]

struct Float3Job : IJob

{

public int dataSize;

[ReadOnly] public NativeArray<float3> dataA;

[ReadOnly] public NativeArray<float3> dataB;

[WriteOnly] public NativeArray<float3> dataOut;

public void Execute()

{

for (int i = 0; i < dataSize; i++)

{

float3 a = dataA[i];

float3 b = dataB[i];

float3 sum = a + b;

float3 mul = a * b;

float3 res = (sum - mul) / 10.0f;

dataOut[i] = res;

}

}

}

Code (CSharp):

cdqe

mov r8, qword ptr [rcx + 8]

mov r9, qword ptr [rcx + 64]

movsd xmm1, qword ptr [r8 + rax]

insertps xmm1, dword ptr [r8 + rax + 8], 32

movsd xmm2, qword ptr [r9 + rax]

insertps xmm2, dword ptr [r9 + rax + 8], 32

movaps xmm3, xmm2

addps xmm3, xmm1

mulps xmm2, xmm1

subps xmm3, xmm2

mulps xmm3, xmm0

mov rdx, qword ptr [rcx + 120]

movss dword ptr [rdx + rax], xmm3

extractps dword ptr [rdx + rax + 4], xmm3, 1

extractps dword ptr [rdx + rax + 8], xmm3, 2

inc r10d

add eax, 12

cmp r10d, dword ptr [rcx]

Next, I used float4 arrays but used swizzles to convert them to float3s for the computations. We get 15 instructions with only one extra blendps:

Code (CSharp):

[ComputeJobOptimization]

struct Float4to3Job : IJob

{

public int dataSize;

[ReadOnly] public NativeArray<float4> dataA;

[ReadOnly] public NativeArray<float4> dataB;

[WriteOnly] public NativeArray<float4> dataOut;

public void Execute()

{

for (int i = 0; i < dataSize; i++)

{

float3 a = dataA[i].xyz;

float3 b = dataB[i].xyz;

float3 sum = a + b;

float3 mul = a * b;

float3 res = (sum - mul) / 10.0f;

dataOut[i] = new float4(res, 0);

}

}

}

Code (CSharp):

cdqe

mov r8, qword ptr [rcx + 8]

mov r9, qword ptr [rcx + 64]

movups xmm2, xmmword ptr [r8 + rax]

movups xmm3, xmmword ptr [r9 + rax]

movaps xmm4, xmm3

addps xmm4, xmm2

mulps xmm2, xmm3

subps xmm4, xmm2

mulps xmm4, xmm0

blendps xmm4, xmm1, 8

mov rdx, qword ptr [rcx + 120]

movups xmmword ptr [rdx + rax], xmm4

inc r10d

add eax, 16

cmp r10d, dword ptr [rcx]

Finally, both the data and computations using float4s. We get 14 instructions, just as a reference scalar code.

Code (CSharp):

[ComputeJobOptimization]

struct Float4Job : IJob

{

public int dataSize;

[ReadOnly] public NativeArray<float4> dataA;

[ReadOnly] public NativeArray<float4> dataB;

[WriteOnly] public NativeArray<float4> dataOut;

public void Execute()

{

for (int i = 0; i < dataSize; i++)

{

float4 a = dataA[i];

float4 b = dataB[i];

float4 sum = a + b;

float4 mul = a * b;

float4 res = (sum - mul)/ 10.0f;

dataOut[i] = res;

}

}

}

Code (CSharp):

cdqe

mov r8, qword ptr [rcx + 8]

mov r9, qword ptr [rcx + 64]

movups xmm1, xmmword ptr [r8 + rax]

movups xmm2, xmmword ptr [r9 + rax]

movaps xmm3, xmm2

addps xmm3, xmm1

mulps xmm1, xmm2

subps xmm3, xmm1

mulps xmm3, xmm0

mov rdx, qword ptr [rcx + 120]

movups xmmword ptr [rdx + rax], xmm3

inc r10d

add eax, 16

cmp r10d, dword ptr [rcx]

korzen303 · Apr 19, 2018

So I have made an additional test and it seems that Burst wont combine float3 and float operations into single SIMD instruction. In other words, it will load float3 as float4 to do the SIMD addps and then it will do the scalar addss on the float.

In this case, I think that the conclusion is: if we are not memory bound, it is the fastest to store the data as float4s. We get all the basic arithmetic operations (addps, mulps etc.) with no insertps and extractps overhead and using float3 ops such as math.cross via float4.xyz swizzle costs only a single extra blend ops.

Also the Burst compiler won't unroll the loops (https://forum.unity.com/threads/loop-vectorization-unroll.527355/) nor combine statements like below into SIMD. Long story short, you need to explicitly ensure that the vectorization is being actually used. Or am I missing something?

Code (CSharp):

for (int i = 0; i < dataSize; i+=4)

{

dataOut[i + 0] = dataA[i + 0] + dataB[i + 0];

dataOut[i + 1] = dataA[i + 1] + dataB[i + 1];

dataOut[i + 2] = dataA[i + 2] + dataB[i + 2];

dataOut[i + 3] = dataA[i + 3] + dataB[i + 3];

}

Code (CSharp):

[ComputeJobOptimization]

struct FloatAndFloat3Job : IJob

{

public int dataSize;

[ReadOnly] public NativeArray<float> dataS;

[ReadOnly] public NativeArray<float3> dataA;

[ReadOnly] public NativeArray<float3> dataB;

[WriteOnly] public NativeArray<float3> dataOut;

public void Execute()

{

for (int i = 0; i < dataSize; i++)

{

float3 a = dataA[i].xyz;

float3 b = dataB[i].xyz;

float s = dataS[i];

float3 sum = a + b;

float sumS = s + s;

float3 mul = a * b;

float3 res = (sum - mul) ;

dataOut[i] = res * sumS;

}

}

}

Code (CSharp):

mov r9, qword ptr [rcx + 8]

mov r10, qword ptr [rcx + 64]

cdqe

movsd xmm0, qword ptr [r10 + rax]

insertps xmm0, dword ptr [r10 + rax + 8], 32

mov r10, qword ptr [rcx + 120]

movsd xmm1, qword ptr [r10 + rax]

insertps xmm1, dword ptr [r10 + rax + 8], 32

movsxd rdx, edx

movss xmm2, dword ptr [r9 + rdx]

movaps xmm3, xmm1

addps xmm3, xmm0

addss xmm2, xmm2

mulps xmm1, xmm0

subps xmm3, xmm1

shufps xmm2, xmm2, 192

mulps xmm2, xmm3

mov r9, qword ptr [rcx + 176]

movss dword ptr [r9 + rax], xmm2

extractps dword ptr [r9 + rax + 4], xmm2, 1

extractps dword ptr [r9 + rax + 8], xmm2, 2

inc r8d

add edx, 4

add eax, 12

cmp r8d, dword ptr [rcx]

jl .LBB0_2

korzen303 · Apr 19, 2018

Due to the above, simplyconverting the data from float1 array to float4 for SIMD requirers more than a double of instructions (10 vs 23)! Is there any better way of doing this? Like some casts of variables or maybe even whole arrays when setting up the jobs?

Code (CSharp):

for (int i = 0; i < dataSize; i ++)

{

//data arrays are float4

float4 a = dataA[i];

float4 b = dataB[i];

float4 sum = a + b;

dataOut[i] = sum;

}

10 instructions

Code (CSharp):

mov r8, qword ptr [rcx + 8]

mov r9, qword ptr [rcx + 64]

movups xmm0, xmmword ptr [r8 + rax]

movups xmm1, xmmword ptr [r9 + rax]

addps xmm1, xmm0

mov rdx, qword ptr [rcx + 120]

movups xmmword ptr [rdx + rax], xmm1

inc r10d

add eax, 16

cmp r10d, dword ptr [rcx]

Code (CSharp):

for (int i = 0; i < dataSize; i+=4)

{

//data arrays are float1

float4 a = new float4(dataA[i + 0], dataA[i + 1], dataA[i + 2], dataA[i + 3]);

float4 b = new float4(dataB[i + 0], dataB[i + 1], dataB[i + 2], dataB[i + 3]);

float4 sum = a + b;

dataOut[i + 0] = sum.x;

dataOut[i + 1] = sum.y;

dataOut[i + 2] = sum.z;

dataOut[i + 3] = sum.w;

}

23 instructions

Code (CSharp):

lea eax, [rdx - 12]

movsxd r9, eax

lea eax, [rdx - 8]

movsxd r10, eax

lea eax, [rdx - 4]

movsxd r11, eax

movsxd rdx, edx

mov rax, qword ptr [rcx + 8]

mov rsi, qword ptr [rcx + 64]

movups xmm0, xmmword ptr [rax + r9]

movups xmm1, xmmword ptr [rsi + r9]

addps xmm1, xmm0

mov rax, qword ptr [rcx + 120]

movss dword ptr [rax + r9], xmm1

mov rax, qword ptr [rcx + 120]

extractps dword ptr [rax + r10], xmm1, 1

mov rax, qword ptr [rcx + 120]

extractps dword ptr [rax + r11], xmm1, 2

mov rax, qword ptr [rcx + 120]

extractps dword ptr [rax + rdx], xmm1, 3

add r8d, 4

add edx, 16

cmp r8d, dword ptr [rcx]

M_R · Apr 19, 2018

if you have an array of float1 you need to convert them somewhere.
or you can do the equivalent of C++ "reinterpret_cast<float4*>(float1Array)" using unsafe code

Joachim_Ante · Apr 19, 2018

If you want to ensure that your code is 100% vectorised. Keep everything as float4 / int4 and do manual SOA form.

Over time we will keep improving burst and particularly the auto-vectorisation to make sure we produce the best possible code when it is written as scalar code.

korzen303 · Apr 24, 2018

Thanks, I have shared the above code snippets here https://github.com/korzen/Unity3D-JobsSystemAndBurstSamples

xoofx · May 2, 2018

Sorry for the late feedback.

Currently, there is an issue with the released version of burst 0.2.3 that has a regression with noalias. As you have experienced, all your scalar loops are not auto-vectorized. It has been fixed on our side, we will publish hopefully a new version in the coming days.

Note that there will be still cases where the auto-vectorizer won't be able to do its job. For the 2018.2 beta, we will give more details about these cases (some of them are in this thread) and we will work to improve auto-vectorization in subsequent releases

xoofx · May 3, 2018

FYI, we just released a new version of burst that is fixing the regression when auto-vectorizing scalar loops. You can update to latest burst package `0.2.4-preview.4` (you can follow this post on how to update your manifest.json)

This should vectorize correctly the scalar loop in your example above and make it equivalent to a manual float4 loop.

Search Unity

Burst, SIMD and float3 / float4 - best practices

korzen303

korzen303

korzen303

M_R

Joachim_Ante

Unity Technologies

korzen303

xoofx

Unity Technologies

xoofx

Unity Technologies

Search Unity

Unity ID

Useful Searches

Burst, SIMD and float3 / float4 - best practices

Unity Technologies

Unity Technologies

Unity Technologies