Loop Vectorization (unroll)

korzen303 · Apr 19, 2018

I have been preparing a tutorial on Job System and Burst and wanted to demonstrate simple SIMD optimizations. I have manually unrolled a loop yet the generated code is still non-vectorized (this is also the case for IJobParallelFor)

EDIT: Here is a more in-depth analysis: https://forum.unity.com/threads/burst-simd-and-float3-float4-best-practices.527504/

Code (CSharp):

[ComputeJobOptimization]

struct SumUnrollJob : IJob

{

public int dataSize;

[ReadOnly] public NativeArray<float> dataA;

[ReadOnly] public NativeArray<float> dataB;

[WriteOnly] public NativeArray<float> dataOut;

public void Execute()

{

//UNROLL

for (int i = 0; i < dataSize; i+=4)

{

dataOut[i + 0] = dataA[i + 0] + dataB[i + 0];

dataOut[i + 1] = dataA[i + 1] + dataB[i + 1];

dataOut[i + 2] = dataA[i + 2] + dataB[i + 2];

dataOut[i + 3] = dataA[i + 3] + dataB[i + 3];

}

}

}

Code (CSharp):

lea eax, [rdx - 12]

movsxd r10, eax

mov r9, qword ptr [rcx + 8]

mov rax, qword ptr [rcx + 64]

movss xmm0, dword ptr [rax + r10]

addss xmm0, dword ptr [r9 + r10]

mov rax, qword ptr [rcx + 120]

movss dword ptr [rax + r10], xmm0

lea eax, [rdx - 8]

movsxd r10, eax

mov r9, qword ptr [rcx + 8]

mov rax, qword ptr [rcx + 64]

movss xmm0, dword ptr [rax + r10]

addss xmm0, dword ptr [r9 + r10]

mov rax, qword ptr [rcx + 120]

movss dword ptr [rax + r10], xmm0

lea eax, [rdx - 4]

movsxd r10, eax

mov r9, qword ptr [rcx + 8]

mov rax, qword ptr [rcx + 64]

movss xmm0, dword ptr [rax + r10]

addss xmm0, dword ptr [r9 + r10]

mov rax, qword ptr [rcx + 120]

movss dword ptr [rax + r10], xmm0

movsxd rdx, edx

mov r9, qword ptr [rcx + 8]

mov rax, qword ptr [rcx + 64]

movss xmm0, dword ptr [rax + rdx]

addss xmm0, dword ptr [r9 + rdx]

mov rax, qword ptr [rcx + 120]

movss dword ptr [rax + rdx], xmm0

add r8d, 4

add edx, 16

cmp r8d, dword ptr [rcx]

jl .LBB0_2

My next approach was to explicitly use float4. Although, the generated assembly got vectorized addps I didn't notice any speed-ups (around 20ms for 1M elements). Hence, I have been wondering whether it is the most optimal in terms of memory reads/writes (it seems there is quite a lot going on regarding this).

Code (CSharp):

[ComputeJobOptimization]

struct SumUnroll2Job : IJob

{

public int dataSize;

[ReadOnly] public NativeArray<float> dataA;

[ReadOnly] public NativeArray<float> dataB;

[WriteOnly] public NativeArray<float> dataOut;

public void Execute()

{

//UNROLL

for (int i = 0; i < dataSize; i+=4)

{

float4 a = new float4(dataA[i + 0], dataA[i + 1], dataA[i + 2], dataA[i + 3]);

float4 b = new float4(dataB[i + 0], dataB[i + 1], dataB[i + 2], dataB[i + 3]);

float4 sum = a + b;

dataOut[i + 0] = sum.x;

dataOut[i + 1] = sum.y;

dataOut[i + 2] = sum.z;

dataOut[i + 3] = sum.w;

}

}

}

Code (CSharp):

lea eax, [rdx - 12]

movsxd r9, eax

lea eax, [rdx - 8]

movsxd r10, eax

lea eax, [rdx - 4]

movsxd r11, eax

movsxd rdx, edx

mov rax, qword ptr [rcx + 8]

mov rsi, qword ptr [rcx + 64]

movups xmm0, xmmword ptr [rax + r9]

movups xmm1, xmmword ptr [rsi + r9]

addps xmm1, xmm0

mov rax, qword ptr [rcx + 120]

movss dword ptr [rax + r9], xmm1

mov rax, qword ptr [rcx + 120]

extractps dword ptr [rax + r10], xmm1, 1

mov rax, qword ptr [rcx + 120]

extractps dword ptr [rax + r11], xmm1, 2

mov rax, qword ptr [rcx + 120]

extractps dword ptr [rax + rdx], xmm1, 3

add r8d, 4

add edx, 16

cmp r8d, dword ptr [rcx]

jl .LBB0_2

Joachim_Ante · Apr 19, 2018

One thing to check for in terms of performance...
"Jobs -> Enable Burst Safety Checks"

Make sure thats disabled when entering playmode. All the safety checks have big impact on performance for simple loops.

deplinenoise · Apr 19, 2018

If you change your NativeArrays to contain float4, and let the inner loop just be dataOut = dataA + dataB you should see much better codegen.

deplinenoise · Apr 19, 2018

There's is a problem with aliasing analysis in the current 0.2.3 version of burst that is causing your second loop to look much worse than it should. We're tracking this issue internally.

GabrieleUnity · May 4, 2018

@korzen303 We fixed the aliasing issues in the latest burst release. We will update everything soon.

Search Unity

Loop Vectorization (unroll)

korzen303

Joachim_Ante

Unity Technologies

deplinenoise

Unity Technologies

deplinenoise

Unity Technologies

GabrieleUnity

Unity Technologies

Search Unity

Unity ID

Useful Searches

Loop Vectorization (unroll)

korzen303

Joachim_Ante

Unity Technologies

deplinenoise

Unity Technologies

deplinenoise

Unity Technologies

GabrieleUnity

Unity Technologies