Bug Why are regular for loops in HLSL compute shaders not consistent?

JonArnt · Apr 16, 2022

Note! This question was originally posted as a "Help wanted". However, following feedback and continous debugging, it has been changed to "Bug", seeing as I now find that the most likely reason for my results.

Summarized question Why does (not) unrolling a loop affect the accuracy of the computations performed within the loop? And why is the computations performed inside a regular loop dependent on the local computer it is ran on?

Elaboration and background I am writing a compute shader using HLSL for use in a Unity-project (2021.2.9f1). Parts of my code include numerical procedures and highly osciallatory functions, meaning that high computational accuracy is essential.

When comparing my results with an equivalent procedure in Python, I noticed that some deviations in the order of 1e-5. This was concerning, as I did not expect such large errors to be the result of precision differences, e.g., the float-precision in trigonometric or power functions in HLSL.

Ultimatley, after much debugging, I now believe the choice of unrolling or not unrolling a loop to be the cause of the deviation. However, I do find this strange, as I can not seem to find any sources indicating that unrolling a loop affects the accuracy in addition to the "space–time tradeoff".

For clarification, if considering my Python results as the correct solution, unrolling the loop in HLSL gives me better results than what not unrolling gives.

Minimal working example Below is an MWE consisting of a C# script for Unity, the corresponding compute shader where the computations are performed and a screen-shot of my console when running in Unity (2021.2.9f1). Forgive me for a somewhat messy implementation of Newtons method, but I chose to keep it since I believe it might be a cause to this deviation. That is, if simply computing cos(x), then there is not difference between the unrolled and not unrolled. None the less, I still fail to understand how the simple addition of [unroll(N)] in the testing kernel changes the result...

Code (CSharp):

// C# for Unity

using UnityEngine;

public class UnrollTest : MonoBehaviour

{

[SerializeField] ComputeShader CS;

ComputeBuffer CBUnrolled, CBNotUnrolled;

readonly int N = 3;

private void Start()

{

CBUnrolled = new ComputeBuffer(N, sizeof(double));

CBNotUnrolled = new ComputeBuffer(N, sizeof(double));

CS.SetBuffer(0, "_CBUnrolled", CBUnrolled);

CS.SetBuffer(0, "_CBNotUnrolled", CBNotUnrolled);

CS.Dispatch(0, (int)((N + (64 - 1)) / 64), 1, 1);

double[] ansUnrolled = new double[N];

double[] ansNotUnrolled = new double[N];

CBUnrolled.GetData(ansUnrolled);

CBNotUnrolled.GetData(ansNotUnrolled);

for (int i = 0; i < N; i++)

{

Debug.Log("Unrolled ans = " + ansUnrolled[i] +

" - Not Unrolled ans = " + ansNotUnrolled[i] +

" -- Difference is: " + (ansUnrolled[i] - ansNotUnrolled[i]));

}

CBUnrolled.Release();

CBNotUnrolled.Release();

}

}

Code (CSharp):

#pragma kernel CSMain

RWStructuredBuffer<double> _CBUnrolled, _CBNotUnrolled;

// Dummy function for Newtons method

double fDummy(double k, double fnh, double h, double theta)

{

return fnh * fnh * k * h * cos(theta) * cos(theta) - (double) tanh(k * h);

}

// Derivative of Dummy function above using a central finite difference scheme.

double dfDummy(double k, double fnh, double h, double theta)

{

return (fDummy(k + (double) 1e-3, fnh, h, theta) - fDummy(k - (double) 1e-3, fnh, h, theta)) / (double) 2e-3;

}

// Function to solve.

double f(double fnh, double h, double theta)

{

// Solved using Newton's method.

int max_iter = 50;

double epsilon = 1e-8;

double fxn, dfxn;

// Define initial guess for k, herby denoted as x.

double xn = 10.0;

for (int n = 0; n < max_iter; n++)

{

fxn = fDummy(xn, fnh, h, theta);

if (abs(fxn) < epsilon) // A solution is found.

return xn;

dfxn = dfDummy(xn, fnh, h, theta);

if (dfxn == 0.0) // No solution found.

return xn;

xn = xn - fxn / dfxn;

}

// No solution found.

return xn;

}

[numthreads(64,1,1)]

void CSMain(uint3 threadID : SV_DispatchThreadID)

{

int N = 3;

// ---------------

double fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.

for (int i = 0; i < N; i++) // Not being unrolled

{

_CBNotUnrolled[i] = f(fnh, h, theta);

theta += dtheta;

}

// ---------------

fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.

[unroll(N)] for (int j = 0; j < N; j++) // Being unrolled.

{

_CBUnrolled[j] = f(fnh, h, theta);

theta += dtheta;

}

}

Edit After some more testing, the deviation seems to be connected to the function dfDummy (seen in the compute shader above), which is a central difference of fDummy. The below script shows the exact same code unrolled and not unrolled, giving an error in the order of 1e-12.

Code (CSharp):

void CSMain(uint3 threadID : SV_DispatchThreadID)

{

int N = 3;

// --------------- Not being unrolled

double theta = -0.161, dtheta = 0.01; // Example values.

for (int i = 0; i < N; i++)

{

_CBNotUnrolled[i] = dfDummy(10.0, 0.9, 4.53052, theta);

theta += dtheta;

}

// --------------- Being unrolled.

theta = -0.161, dtheta = 0.01; // Example values.

[unroll(N)] for (int j = 0; j < N; j++)

{

_CBUnrolled[j] = dfDummy(10.0, 0.9, 4.53052, theta);

theta += dtheta;

}

}

Kurt-Dekker · Apr 12, 2022

I haven't messed with this toolchain much but is there a way you can see anything about the produced output instructions, kind of like with
gcc
how you can give it the
-S
argument to see the resultant assembly language?

This might reveal a bug in the unroller where they (for instance) pass in your constants as
float
instead of
double
, or any other differences.

JonArnt · Apr 12, 2022

I must admit that my experience with assembly language is very limited, but here is what I have so far.

First, some minor changes were made to the main script to avoid warnings concerning the double precision of tanh(x) for large x and a race condition for writing to a shared resource. The updated code is as follows, still giving a deviation in the order of 1e-13.

Code (CSharp):

void CSMain(uint3 threadID : SV_DispatchThreadID)

{

if ((int) threadID.x != 1)

return;

int N = 3;

// --------------- Not being unrolled

double theta = -0.161, dtheta = 0.01; // Example values.

for (int i = 0; i < N; i++)

{

_CBNotUnrolled[i] = dfDummy(1.0, 0.9, 4.53052, theta);

theta += dtheta;

}

// --------------- Being unrolled.

theta = -0.161, dtheta = 0.01; // Example values.

[unroll(N)]

for (int j = 0; j < N; j++)

{

_CBUnrolled[j] = dfDummy(1.0, 0.9, 4.53052, theta);

theta += dtheta;

}

}

The precompiled code of the above, which was found in the inspector when selecting the compute shader, is as follows. Please correct me if this is not the output you were thinking of.

Code (CSharp):

**** Platform Direct3D 11:

Compiled code for kernel CSMain

keywords: <none>

binary blob size 820:

//

// Generated by Microsoft (R) D3D Shader Disassembler

//

//

// Note: shader requires additional functionality:

// Double-precision floating point

// Double-precision extensions for 11.1

//

//

// Input signature:

//

// Name Index Mask Register SysValue Format Used

// -------------------- ----- ------ -------- -------- ------- ------

// no Input

//

// Output signature:

//

// Name Index Mask Register SysValue Format Used

// -------------------- ----- ------ -------- -------- ------- ------

// no Output

cs_5_0

dcl_globalFlags refactoringAllowed | enableDoublePrecisionFloatOps | enable11_1DoubleExtensions

dcl_uav_structured u0, 8

dcl_uav_structured u1, 8

dcl_input vThreadID.x

dcl_temps 3

dcl_thread_group 64, 1, 1

0: ine r0.x, vThreadID.x, l(1)

1: if_nz r0.x

2: ret

3: endif

4: dmov r0.xy, d(-0.161000l, 0.000000l)

5: mov r0.z, l(0)

6: loop

7: ige r0.w, r0.z, l(3)

8: breakc_nz r0.w

9: dtof r0.w, r0.xyxy

10: sincos null, r0.w, r0.w

11: ftod r1.xy, r0.w

12: dmul r2.xyzw, r1.xyxy, d(3.673391l, 3.666051l)

13: dmul r1.xyzw, r1.xyxy, r2.xyzw

14: dadd r1.xyzw, r1.xyzw, d(-0.999770l, -0.999766l)

15: dadd r1.xy, -r1.zwzw, r1.xyxy

16: ddiv r1.xy, r1.xyxy, d(0.002000l, 0.000000l)

17: store_structured u1.xy, r0.z, l(0), r1.xyxx

18: dadd r0.xy, r0.xyxy, d(0.010000l, 0.000000l)

19: iadd r0.z, r0.z, l(1)

20: endloop

21: store_structured u0.xy, l(0), l(0), l(-65766687071412712330231808.000000,2.196662,0,0)

22: store_structured u0.xy, l(1), l(0), l(-1190969931871505988190208.000000,2.198071,0,0)

23: store_structured u0.xy, l(2), l(0), l(0.000001,2.199391,0,0)

24: ret

// Approximately 0 instruction slots used

Edit When you mentioned looking for a possible error in the unroller, you made me curious on a statement of mine in the original post, where I stated that if my Python code were to be considered the true solution, then the unrolled loop gives better results than the regular loop. I have now gone through these numbers again, which (assuming no implementation errors of my own) verifies this claim.

Personally, my hypothesis is that Python (and hence also the unrolled loop in HLSL, but with some deviations presumebly due to float precision limitations in cos(x) and tanh(x)) is indeed correct, which I am basing on the fact that my main project in Python gives better results than what my Unity project does. I have not been able to test the Unity project with unrolled loops, as it only results in a compiler timeout.

JonArnt · Apr 15, 2022

After reaching out to Microsoft, (see https://docs.microsoft.com/en-us/an...nrolling-a-loop-affect-the-accuracy-of-t.html), they stated that the problem is more about Unity. This because "The pragma
unroll [(n)]
is keil compiler which Unity uses topic."

Does this help limit the scope of the problem, and does anyone have a suggestion to where I should look when continuing my debugging?

Bunny83 · Apr 15, 2022

JonArnt said: ↑

"The pragma
unroll [(n)]
is keil compiler which Unity uses topic."
Click to expand...

While I haven't ever used the unroll attribute in a shader, I wouldn't trust this statements for several reasons.

unroll is an official HLSL attribute and not a pragma.

I wasn't even aware of the Keil compiler. Though it's a C compiler for embedded environments which are based on the ARM architechture. So maybe Unity / Android Studio uses it when targetting android. Since you're testing inside the editor it's very unlikely (though I can't really confirm) that this has anything to do with the keil compiler.

To me that sounds like an attempted excuse to not look further into the case Though that's just my opinion.

It's hard to tell what may cause those issues. Since you use doubles everywhere, this could be part of the problem. GPUs are optimised for single (32 bit) float datatypes. Unity did not even mention double in the documentation about shader data types. Support for the double type is essentially an "extension" as the comments in the decompiler listing even stated

Code (CSharp):

// Note: shader requires additional functionality:

// Double-precision floating point

// Double-precision extensions for 11.1

Since your margin of error is about in the range of a 32 bit float, I guess there may be some conversion into 32 bit floats somewhere. This "may" be the result of a faulty unrolling code, but this is just pure guesswork...

JonArnt · Apr 15, 2022

I have now reproduced my issue (using the exact code given in this inital question) on a different computer (with the same Unity version), and the deviation is present but not similar. What initially was an error in the order of 1e-5 on my laptop has now become an error in the order of 1e-8. To me, this suggest that the issue is not connected to Unity, although I cannot be certain.

None the less, seeing as this issue does affect Unity projects if encountered, I will continue updating this post as long as I have progress. I am also greatful for previous and new thoughts on where the problem might lie, as this currently is moving further and further away from my current computing knowledge and experience.

Neto_Kokku · Apr 15, 2022

Looking at the ASM output, the unrolled version isn't actually performing any calculation: the shader compiler figured out you are using only constant values in your unrolled loop, calculated everything at compilation time, and wrote the results directly in the shader as constants.

Look at the three
store_structured
instructions at the end of the compiled shader: you can see the computed values right there. The shader compiler probably used actual double precision trigonometric functions while resolving the values, while the not-unrolled loop had to use the single-precision cos/tanh functions.

JonArnt · Apr 16, 2022

After performing multiple changes to my code, I now believe there to be a bug present in the regular for-loops in HLSL compute shaders (not sure if only for Unity or in general though).

Summary:

The regular for loops in HLSL compute shaders gives different results depending on the local computer it is ran on (see image below).

The regular for loops does also give different results from the unrolled version of the same loop, even on the same local computer.

The unrolled for loops does give the same result for different local computer.

This error is most visible for double precision, but also present for float precision.

I have made a Unity project displaying the error and made it available on GitHub: https://github.com/JonArntK/Debugging-for-loops . All it does is to compute a set of values and display them to the screen next to a set of values computed on my laptop (the author of the script). My values are hard-coded into the script, and will hence never change. Below is an image showing the result when I ran the program on a different machine (not my laptop). On my laptop, the two columns are exactly alike.

To limit the possible sources of error, the example program is made using only floats (i.e. not doubles).

As an additonal note, I am wondering how to proceed with this thread, seeing as it has changed a bit in topic from the original post (due to good feedback and continous debugging). Should I update the original question, should I contiue like this or should a new thread be made entirely?

Bunny83 · Apr 16, 2022

Well, since this is still the same issue I don't see a point in creating a new thread.

As @Neto_Kokku pointed out your unrolled loop my go through a more sophisticated optimisation routine and completely get rid of any calculations if it can statically determine the results during compile time. Another thing I quickly noticed about your test code is this line:

Code (CSharp):

_CBNotUnrolled[i] = ((k + 1e-3) * theta - (k - 1e-3) * theta) / 2e-3;

From a pure mathematical point of view this is just

Code (CSharp):

_CBNotUnrolled[i] = theta;

Code (CSharp):

((k + 1e-3) * theta - (k - 1e-3) * theta) / 2e-3

((k + 1e-3 - k + 1e-3) * theta) / 2e-3

((1e-3 + 1e-3) * theta) / 2e-3

((2e-3) * theta) / 2e-3

theta

So that's pretty pointless. If one is resolved at compile time and the other results in the GPU doing actual calculations, of course you would get different results.

If you want to create proper test cases, the arguments should not be constants within the code. Have you tried reading the values in from globals so it can not be optimised away by the compiler?

Currently I'm not into game dev at all and I don't have much time to play around Though it would be interesting if we could actually pinpoint the issue.

JonArnt · Apr 16, 2022

@Bunny83 You are indeed correct that the mathematical representation is just theta, which I have completely missed. However, it is not immediately clear to me that this must result in a different answer, as I believed compile-time computations were under the same restrictions as run-time. That is, of course, unless the compiler is able to perform the simplification you made without doing any computations.

If following your recommendation of reading in the value, there still is minor difference between the unrolled and regular loop.

Code (CSharp):

k = 1.0f;

.

.

CS.SetFloat("_k", k);

In the right column, row 0 for not unrolled and unrolled shows that there is a difference, even when 'k' is not a constant.

Unlike the case above, there is no such issue for

Code (CSharp):

_CBNotUnrolled[i] = f(fnh, h, theta);

which can not be similarly optimized at compile time (I presume).

In this new updated code (referenced in GitHub), all doubles have also been exchanged with floats, indicating that the issue is not due to type convertion. I do also find it strange that the change of hardware should give such deviations.

Neto_Kokku · Apr 16, 2022

JonArnt said: ↑

@Bunny83 You are indeed correct that the mathematical representation is just theta, which I have completely missed. However, it is not immediately clear to me that this must result in a different answer, as I believed compile-time computations were under the same restrictions as run-time. That is, of course, unless the compiler is able to perform the simplification you made without doing any computations.

If following your recommendation of reading in the value, there still is minor difference between the unrolled and regular loop.

Code (CSharp):

k = 1.0f;

.

.

CS.SetFloat("_k", k);

In the right column, row 0 for not unrolled and unrolled shows that there is a difference, even when 'k' is not a constant.

Unlike the case above, there is no such issue for

Code (CSharp):

_CBNotUnrolled[i] = f(fnh, h, theta);

which can not be similarly optimized at compile time (I presume).

In this new updated code (referenced in GitHub), all doubles have also been exchanged with floats, indicating that the issue is not due to type convertion. I do also find it strange that the change of hardware should give such deviations.
Click to expand...

Check the compiled shader to make sure it's not pre-calculating stuff. You don't need to master reading ASM to be able to do it, since hard coded floats/doubles are pretty easy to spot. It would also be easier to compare if you have the rolled/unrolled in different kernels, instead of in the same kernel, so they don't risk interfering with the compilation of each other and make it easier to see only the compiled code for each one.

JonArnt · Apr 16, 2022

Neto_Kokku said: ↑

Check the compiled shader to make sure it's not pre-calculating stuff. You don't need to master reading ASM to be able to do it, since hard coded floats/doubles are pretty easy to spot. It would also be easier to compare if you have the rolled/unrolled in different kernels, instead of in the same kernel, so they don't risk interfering with the compilation of each other and make it easier to see only the compiled code for each one.
Click to expand...

For the results provided in my previous comment, this is the compiled code alongside my interpretation of it. Summarized, I interpret it such that both codes performes the same actions. The computed results are: Unrolled = -0,1609998 and Not unrolled = -0.161. These results assume that they cannot be altered by the C#-script which fetches them from the compute shader and displays them in the Unity console log.

For both codes, the other parts not relevant were commented away, as suggested by @Neto_Kokku.

Code (CSharp):

dcl_constantbuffer CB0[1], immediateIndexed // Declares a shader constant buffer and indices the buffer with a literal value. (assume this is '_k' incoming)

.

.

0: ine r0.x, vThreadID.x, l(1) // Is 'vThreadID.x' not equal to 1?, store result in r0.x

1: if_nz r0.x // (if r0.x != 1)

2: ret // then return

3: endif // (end if)

4: add r0.xy, cb0[0].xxxx, l(0.001000, -0.001000, 0.000000, 0.000000) // Component-wise add of two vectors. (_k + 0.001 is stored in r0.x, _k - 0.001 is stored in r0.y)

5: mov r0.zw, l(0,0,-0.161000,0) // Component-wise move. (theta (= -0.161) is being stored in r0.z, loop 'i' being stored in r0.w)

6: loop // loop

7: ige r1.x, r0.w, l(4) // Component-wise vector integer greater-than-or-equal comparison. (is r0.w greater-than-or-equal to 4?, store result in r1.x)

8: breakc_nz r1.x // Break if any bit in r1.x is nonzero (i.e., if loop is over)

9: ilt r1.x, r0.w, l(2) // Is r0.w less than 2? store result in r1.x

10: if_nz r1.x // (if r1.x != 1)

11: mul r1.x, r0.z, r0.y // Component-wise multiply. r1.x = r0.z * r0.y (i.e., r1.x = theta * (_k - 0.001))

12: mad r1.x, r0.x, r0.z, -r1.x // Component-wise multiply & add. r1.x = r0.x * r0.z + -r1.x (i.e., r1.x = (_k + 0.001) * theta - r1.x)

13: mul r1.x, r1.x, l(499.999969) // r1.x = r1.x * 500 (to float precision) (this is equivalent to the '/ 2e-3' in my code)

14: store_structured u0.x, r0.w, l(0), r1.x // Store r1.x

15: endif

16: add r0.z, r0.z, l(0.010000) // Add dtheta

17: iadd r0.w, r0.w, l(1) // i++

18: endloop // end loop, do it all over again

19: ret

Code (CSharp):

dcl_constantbuffer CB0[1], immediateIndexed // Similar to above

.

.

0: ine r0.x, vThreadID.x, l(1) // Similar

1: if_nz r0.x // Similar

2: ret // Similar

3: endif // Similar

4: add r0.xyzw, cb0[0].xxxx, l(0.001000, -0.001000, 0.001000, -0.001000) // r0.x = _k + 0.001 | r0.y = _k - 0.001 | r0.z = _k + 0.001 | r0.w = _k - 0.001

5: mul r0.yw, r0.yyyw, l(0.000000, -0.161000, 0.000000, -0.151000) // A bit unsure of the details, but assume it stores 'theta * (_k - 0.001)' for both loop iterations (dtheta = 0.1, which is why we have theta = -0.151 for iteration 2).

6: mad r0.xy, r0.xzxx, l(-0.161000, -0.151000, 0.000000, 0.000000), -r0.ywyy // Computes 'theta * (_k + 0.001)' for both loop iterations and subtracts 'theta * (_k - 0.001)' which were computed on previous line.

7: mul r0.xy, r0.xyxx, l(499.999969, 499.999969, 0.000000, 0.000000) // Multiplies with 500 (float presicion) for both loop iterations.

8: store_structured u0.x, l(0), l(0), r0.x // Stores loop 1 result

9: store_structured u0.x, l(1), l(0), r0.y // Stores loop 2 result

10: ret

Search Unity

Unity ID

Useful Searches

Bug Why are regular for loops in HLSL compute shaders not consistent?