# Bug Why are regular for loops in HLSL compute shaders not consistent?

Discussion in 'Scripting' started by JonArnt, Apr 12, 2022.

1. ### JonArnt

Joined:
Oct 23, 2021
Posts:
7
Note! This question was originally posted as a "Help wanted". However, following feedback and continous debugging, it has been changed to "Bug", seeing as I now find that the most likely reason for my results.

Summarized question
Why does (not) unrolling a loop affect the accuracy of the computations performed within the loop? And why is the computations performed inside a regular loop dependent on the local computer it is ran on?

Elaboration and background I am writing a compute shader using HLSL for use in a Unity-project (2021.2.9f1). Parts of my code include numerical procedures and highly osciallatory functions, meaning that high computational accuracy is essential.

When comparing my results with an equivalent procedure in Python, I noticed that some deviations in the order of 1e-5. This was concerning, as I did not expect such large errors to be the result of precision differences, e.g., the float-precision in trigonometric or power functions in HLSL.

Ultimatley, after much debugging, I now believe the choice of unrolling or not unrolling a loop to be the cause of the deviation. However, I do find this strange, as I can not seem to find any sources indicating that unrolling a loop affects the accuracy in addition to the "space–time tradeoff".

For clarification, if considering my Python results as the correct solution, unrolling the loop in HLSL gives me better results than what not unrolling gives.

Minimal working example Below is an MWE consisting of a C# script for Unity, the corresponding compute shader where the computations are performed and a screen-shot of my console when running in Unity (2021.2.9f1). Forgive me for a somewhat messy implementation of Newtons method, but I chose to keep it since I believe it might be a cause to this deviation. That is, if simply computing cos(x), then there is not difference between the unrolled and not unrolled. None the less, I still fail to understand how the simple addition of [unroll(N)] in the testing kernel changes the result...
Code (CSharp):
1. // C# for Unity
2. using UnityEngine;
3.
4. public class UnrollTest : MonoBehaviour
5. {
6.
7.     [SerializeField] ComputeShader CS;
8.     ComputeBuffer CBUnrolled, CBNotUnrolled;
9.     readonly int N = 3;
10.
11.     private void Start()
12.     {
13.
14.         CBUnrolled = new ComputeBuffer(N, sizeof(double));
15.         CBNotUnrolled = new ComputeBuffer(N, sizeof(double));
16.
17.         CS.SetBuffer(0, "_CBUnrolled", CBUnrolled);
18.         CS.SetBuffer(0, "_CBNotUnrolled", CBNotUnrolled);
19.
20.         CS.Dispatch(0, (int)((N + (64 - 1)) / 64), 1, 1);
21.
22.         double[] ansUnrolled = new double[N];
23.         double[] ansNotUnrolled = new double[N];
24.
25.         CBUnrolled.GetData(ansUnrolled);
26.         CBNotUnrolled.GetData(ansNotUnrolled);
27.
28.         for (int i = 0; i < N; i++)
29.         {
30.             Debug.Log("Unrolled ans = " + ansUnrolled[i] +
31.                 "  -  Not Unrolled ans = " + ansNotUnrolled[i] +
32.                 "  --  Difference is: " + (ansUnrolled[i] - ansNotUnrolled[i]));
33.         }
34.         CBUnrolled.Release();
35.         CBNotUnrolled.Release();
36.     }
37. }
Code (CSharp):
1. #pragma kernel CSMain
2.
3. RWStructuredBuffer<double> _CBUnrolled, _CBNotUnrolled;
4.
5. // Dummy function for Newtons method
6. double fDummy(double k, double fnh, double h, double theta)
7. {
8.     return fnh * fnh * k * h * cos(theta) * cos(theta) - (double) tanh(k * h);
9. }
10.
11. // Derivative of Dummy function above using a central finite difference scheme.
12. double dfDummy(double k, double fnh, double h, double theta)
13. {
14.     return (fDummy(k + (double) 1e-3, fnh, h, theta) - fDummy(k - (double) 1e-3, fnh, h, theta)) / (double) 2e-3;
15. }
16.
17. // Function to solve.
18. double f(double fnh, double h, double theta)
19. {
20.     // Solved using Newton's method.
21.     int max_iter = 50;
22.     double epsilon = 1e-8;
23.     double fxn, dfxn;
24.
25.     // Define initial guess for k, herby denoted as x.
26.     double xn = 10.0;
27.
28.     for (int n = 0; n < max_iter; n++)
29.     {
30.         fxn = fDummy(xn, fnh, h, theta);
31.
32.         if (abs(fxn) < epsilon)     // A solution is found.
33.             return xn;
34.
35.         dfxn = dfDummy(xn, fnh, h, theta);
36.
37.         if (dfxn == 0.0)    // No solution found.
38.             return xn;
39.
40.         xn = xn - fxn / dfxn;
41.     }
42.
43.     // No solution found.
44.     return xn;
45. }
46.
49. {
50.     int N = 3;
51.
52.     // ---------------
53.     double fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01;   // Example values.
54.
55.     for (int i = 0; i < N; i++)                 // Not being unrolled
56.     {
57.         _CBNotUnrolled[i] = f(fnh, h, theta);
58.         theta += dtheta;
59.     }
60.
61.     // ---------------
62.     fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01;          // Example values.
63.
64.     [unroll(N)] for (int j = 0; j < N; j++)     // Being unrolled.
65.     {
66.         _CBUnrolled[j] = f(fnh, h, theta);
67.         theta += dtheta;
68.     }
69. }

Edit After some more testing, the deviation seems to be connected to the function dfDummy (seen in the compute shader above), which is a central difference of fDummy. The below script shows the exact same code unrolled and not unrolled, giving an error in the order of 1e-12.

Code (CSharp):
2. {
3.    int N = 3;
4.
5.    // ---------------  Not being unrolled
6.    double theta = -0.161, dtheta = 0.01;   // Example values.
7.    for (int i = 0; i < N; i++)
8.    {
9.        _CBNotUnrolled[i] = dfDummy(10.0, 0.9, 4.53052, theta);
10.        theta += dtheta;
11.    }
12.
13.    // ---------------  Being unrolled.
14.    theta = -0.161, dtheta = 0.01;          // Example values.
15.    [unroll(N)] for (int j = 0; j < N; j++)
16.    {
17.        _CBUnrolled[j] = dfDummy(10.0, 0.9, 4.53052, theta);
18.        theta += dtheta;
19.    }
20. }

Last edited: Apr 16, 2022
2. ### Kurt-Dekker

Joined:
Mar 16, 2013
Posts:
25,836
I haven't messed with this toolchain much but is there a way you can see anything about the produced output instructions, kind of like with
``gcc``
how you can give it the
``-S``
argument to see the resultant assembly language?

This might reveal a bug in the unroller where they (for instance) pass in your constants as
``float``
``double``
, or any other differences.

3. ### JonArnt

Joined:
Oct 23, 2021
Posts:
7
I must admit that my experience with assembly language is very limited, but here is what I have so far.

First, some minor changes were made to the main script to avoid warnings concerning the double precision of tanh(x) for large x and a race condition for writing to a shared resource. The updated code is as follows, still giving a deviation in the order of 1e-13.
Code (CSharp):
2. {
3.     if ((int) threadID.x != 1)
4.         return;
5.
6.     int N = 3;
7.
8.     // ---------------  Not being unrolled
9.     double theta = -0.161, dtheta = 0.01; // Example values.
10.     for (int i = 0; i < N; i++)
11.     {
12.         _CBNotUnrolled[i] = dfDummy(1.0, 0.9, 4.53052, theta);
13.         theta += dtheta;
14.     }
15.
16.     // ---------------  Being unrolled.
17.     theta = -0.161, dtheta = 0.01; // Example values.
18.     [unroll(N)]
19.     for (int j = 0; j < N; j++)
20.     {
21.         _CBUnrolled[j] = dfDummy(1.0, 0.9, 4.53052, theta);
22.         theta += dtheta;
23.     }
24. }
The precompiled code of the above, which was found in the inspector when selecting the compute shader, is as follows. Please correct me if this is not the output you were thinking of.
Code (CSharp):
1. **** Platform Direct3D 11:
2. Compiled code for kernel CSMain
3. keywords: <none>
4. binary blob size 820:
5. //
6. // Generated by Microsoft (R) D3D Shader Disassembler
7. //
8. //
9. // Note: shader requires additional functionality:
10. //       Double-precision floating point
11. //       Double-precision extensions for 11.1
12. //
13. //
14. // Input signature:
15. //
16. // Name                 Index   Mask Register SysValue  Format   Used
17. // -------------------- ----- ------ -------- -------- ------- ------
18. // no Input
19. //
20. // Output signature:
21. //
22. // Name                 Index   Mask Register SysValue  Format   Used
23. // -------------------- ----- ------ -------- -------- ------- ------
24. // no Output
25.       cs_5_0
26.       dcl_globalFlags refactoringAllowed | enableDoublePrecisionFloatOps | enable11_1DoubleExtensions
27.       dcl_uav_structured u0, 8
28.       dcl_uav_structured u1, 8
30.       dcl_temps 3
31.       dcl_thread_group 64, 1, 1
32.    0: ine r0.x, vThreadID.x, l(1)
33.    1: if_nz r0.x
34.    2:   ret
35.    3: endif
36.    4: dmov r0.xy, d(-0.161000l, 0.000000l)
37.    5: mov r0.z, l(0)
38.    6: loop
39.    7:   ige r0.w, r0.z, l(3)
40.    8:   breakc_nz r0.w
41.    9:   dtof r0.w, r0.xyxy
42.   10:   sincos null, r0.w, r0.w
43.   11:   ftod r1.xy, r0.w
44.   12:   dmul r2.xyzw, r1.xyxy, d(3.673391l, 3.666051l)
45.   13:   dmul r1.xyzw, r1.xyxy, r2.xyzw
46.   14:   dadd r1.xyzw, r1.xyzw, d(-0.999770l, -0.999766l)
47.   15:   dadd r1.xy, -r1.zwzw, r1.xyxy
48.   16:   ddiv r1.xy, r1.xyxy, d(0.002000l, 0.000000l)
49.   17:   store_structured u1.xy, r0.z, l(0), r1.xyxx
50.   18:   dadd r0.xy, r0.xyxy, d(0.010000l, 0.000000l)
51.   19:   iadd r0.z, r0.z, l(1)
52.   20: endloop
53.   21: store_structured u0.xy, l(0), l(0), l(-65766687071412712330231808.000000,2.196662,0,0)
54.   22: store_structured u0.xy, l(1), l(0), l(-1190969931871505988190208.000000,2.198071,0,0)
55.   23: store_structured u0.xy, l(2), l(0), l(0.000001,2.199391,0,0)
56.   24: ret
57. // Approximately 0 instruction slots used
58.
59.
60.
61.
62.
Edit When you mentioned looking for a possible error in the unroller, you made me curious on a statement of mine in the original post, where I stated that if my Python code were to be considered the true solution, then the unrolled loop gives better results than the regular loop. I have now gone through these numbers again, which (assuming no implementation errors of my own) verifies this claim.

Personally, my hypothesis is that Python (and hence also the unrolled loop in HLSL, but with some deviations presumebly due to float precision limitations in cos(x) and tanh(x)) is indeed correct, which I am basing on the fact that my main project in Python gives better results than what my Unity project does. I have not been able to test the Unity project with unrolled loops, as it only results in a compiler timeout.

Last edited: Apr 12, 2022
4. ### JonArnt

Joined:
Oct 23, 2021
Posts:
7
After reaching out to Microsoft, (see https://docs.microsoft.com/en-us/an...nrolling-a-loop-affect-the-accuracy-of-t.html), they stated that the problem is more about Unity. This because "The pragma
``unroll [(n)]``
is keil compiler which Unity uses topic."

Does this help limit the scope of the problem, and does anyone have a suggestion to where I should look when continuing my debugging?

Last edited: Apr 15, 2022
5. ### Bunny83

Joined:
Oct 18, 2010
Posts:
1,871
While I haven't ever used the unroll attribute in a shader, I wouldn't trust this statements for several reasons.
1. unroll is an official HLSL attribute and not a pragma.
2. I wasn't even aware of the Keil compiler. Though it's a C compiler for embedded environments which are based on the ARM architechture. So maybe Unity / Android Studio uses it when targetting android. Since you're testing inside the editor it's very unlikely (though I can't really confirm) that this has anything to do with the keil compiler.
To me that sounds like an attempted excuse to not look further into the case Though that's just my opinion.

It's hard to tell what may cause those issues. Since you use doubles everywhere, this could be part of the problem. GPUs are optimised for single (32 bit) float datatypes. Unity did not even mention double in the documentation about shader data types. Support for the double type is essentially an "extension" as the comments in the decompiler listing even stated
Code (CSharp):
1. // Note: shader requires additional functionality:
2. //       Double-precision floating point
3. //       Double-precision extensions for 11.1

Since your margin of error is about in the range of a 32 bit float, I guess there may be some conversion into 32 bit floats somewhere. This "may" be the result of a faulty unrolling code, but this is just pure guesswork...

6. ### JonArnt

Joined:
Oct 23, 2021
Posts:
7
I have now reproduced my issue (using the exact code given in this inital question) on a different computer (with the same Unity version), and the deviation is present but not similar. What initially was an error in the order of 1e-5 on my laptop has now become an error in the order of 1e-8. To me, this suggest that the issue is not connected to Unity, although I cannot be certain.

None the less, seeing as this issue does affect Unity projects if encountered, I will continue updating this post as long as I have progress. I am also greatful for previous and new thoughts on where the problem might lie, as this currently is moving further and further away from my current computing knowledge and experience.

7. ### Neto_Kokku

Joined:
Feb 15, 2018
Posts:
1,275
Looking at the ASM output, the unrolled version isn't actually performing any calculation: the shader compiler figured out you are using only constant values in your unrolled loop, calculated everything at compilation time, and wrote the results directly in the shader as constants.

Look at the three
``store_structured``
instructions at the end of the compiled shader: you can see the computed values right there. The shader compiler probably used actual double precision trigonometric functions while resolving the values, while the not-unrolled loop had to use the single-precision cos/tanh functions.

Bunny83 likes this.
8. ### JonArnt

Joined:
Oct 23, 2021
Posts:
7
After performing multiple changes to my code, I now believe there to be a bug present in the regular for-loops in HLSL compute shaders (not sure if only for Unity or in general though).

Summary:
• The regular for loops in HLSL compute shaders gives different results depending on the local computer it is ran on (see image below).
• The regular for loops does also give different results from the unrolled version of the same loop, even on the same local computer.
• The unrolled for loops does give the same result for different local computer.
• This error is most visible for double precision, but also present for float precision.

I have made a Unity project displaying the error and made it available on GitHub: https://github.com/JonArntK/Debugging-for-loops . All it does is to compute a set of values and display them to the screen next to a set of values computed on my laptop (the author of the script). My values are hard-coded into the script, and will hence never change. Below is an image showing the result when I ran the program on a different machine (not my laptop). On my laptop, the two columns are exactly alike.

To limit the possible sources of error, the example program is made using only floats (i.e. not doubles).

As an additonal note, I am wondering how to proceed with this thread, seeing as it has changed a bit in topic from the original post (due to good feedback and continous debugging). Should I update the original question, should I contiue like this or should a new thread be made entirely?

9. ### Bunny83

Joined:
Oct 18, 2010
Posts:
1,871
Well, since this is still the same issue I don't see a point in creating a new thread.

As @Neto_Kokku pointed out your unrolled loop my go through a more sophisticated optimisation routine and completely get rid of any calculations if it can statically determine the results during compile time. Another thing I quickly noticed about your test code is this line:
Code (CSharp):
1.  _CBNotUnrolled[i] = ((k + 1e-3) * theta - (k - 1e-3) * theta) / 2e-3;
From a pure mathematical point of view this is just

Code (CSharp):
1.  _CBNotUnrolled[i] = theta;
Code (CSharp):
1. ((k + 1e-3) * theta - (k - 1e-3) * theta) / 2e-3
2. ((k + 1e-3 - k + 1e-3) * theta) / 2e-3
3. ((1e-3 + 1e-3) * theta) / 2e-3
4. ((2e-3) * theta) / 2e-3
5. theta
So that's pretty pointless. If one is resolved at compile time and the other results in the GPU doing actual calculations, of course you would get different results.

If you want to create proper test cases, the arguments should not be constants within the code. Have you tried reading the values in from globals so it can not be optimised away by the compiler?

Currently I'm not into game dev at all and I don't have much time to play around Though it would be interesting if we could actually pinpoint the issue.

JonArnt likes this.
10. ### JonArnt

Joined:
Oct 23, 2021
Posts:
7
@Bunny83 You are indeed correct that the mathematical representation is just theta, which I have completely missed. However, it is not immediately clear to me that this must result in a different answer, as I believed compile-time computations were under the same restrictions as run-time. That is, of course, unless the compiler is able to perform the simplification you made without doing any computations.

If following your recommendation of reading in the value, there still is minor difference between the unrolled and regular loop.

Code (CSharp):
1. k = 1.0f;
2. .
3. .
4. CS.SetFloat("_k", k);
In the right column, row 0 for not unrolled and unrolled shows that there is a difference, even when 'k' is not a constant.

Unlike the case above, there is no such issue for
Code (CSharp):
1. _CBNotUnrolled[i] = f(fnh, h, theta);
which can not be similarly optimized at compile time (I presume).

In this new updated code (referenced in GitHub), all doubles have also been exchanged with floats, indicating that the issue is not due to type convertion. I do also find it strange that the change of hardware should give such deviations.

11. ### Neto_Kokku

Joined:
Feb 15, 2018
Posts:
1,275
Check the compiled shader to make sure it's not pre-calculating stuff. You don't need to master reading ASM to be able to do it, since hard coded floats/doubles are pretty easy to spot. It would also be easier to compare if you have the rolled/unrolled in different kernels, instead of in the same kernel, so they don't risk interfering with the compilation of each other and make it easier to see only the compiled code for each one.

Bunny83 likes this.
12. ### JonArnt

Joined:
Oct 23, 2021
Posts:
7
For the results provided in my previous comment, this is the compiled code alongside my interpretation of it. Summarized, I interpret it such that both codes performes the same actions. The computed results are: Unrolled = -0,1609998 and Not unrolled = -0.161. These results assume that they cannot be altered by the C#-script which fetches them from the compute shader and displays them in the Unity console log.

For both codes, the other parts not relevant were commented away, as suggested by @Neto_Kokku.
Code (CSharp):
1. dcl_constantbuffer CB0[1], immediateIndexed                             // Declares a shader constant buffer and indices the buffer with a literal value. (assume this is '_k' incoming)
2. .
3. .
4.    0: ine r0.x, vThreadID.x, l(1)                                        // Is 'vThreadID.x' not equal to 1?, store result in r0.x
5.    1: if_nz r0.x                                                    // (if r0.x != 1)
6.    2:   ret                                                         //     then return
7.    3: endif                                                         // (end if)
8.    4: add r0.xy, cb0[0].xxxx, l(0.001000, -0.001000, 0.000000, 0.000000)    // Component-wise add of two vectors. (_k + 0.001 is stored in r0.x, _k - 0.001 is stored in r0.y)
9.    5: mov r0.zw, l(0,0,-0.161000,0)                                    // Component-wise move. (theta (= -0.161) is being stored in r0.z, loop 'i' being stored in r0.w)
10.    6: loop                                                         // loop
11.    7:   ige r1.x, r0.w, l(4)                                            // Component-wise vector integer greater-than-or-equal comparison. (is r0.w greater-than-or-equal to 4?, store result in r1.x)
12.    8:   breakc_nz r1.x                                                // Break if any bit in r1.x is nonzero (i.e., if loop is over)
13.    9:   ilt r1.x, r0.w, l(2)                                            // Is r0.w less than 2? store result in r1.x
14.   10:   if_nz r1.x                                                    // (if r1.x != 1)
15.   11:     mul r1.x, r0.z, r0.y                                        // Component-wise multiply. r1.x = r0.z * r0.y (i.e., r1.x = theta * (_k - 0.001))
16.   12:     mad r1.x, r0.x, r0.z, -r1.x                                    // Component-wise multiply & add. r1.x = r0.x * r0.z + -r1.x (i.e., r1.x = (_k + 0.001) * theta - r1.x)
17.   13:     mul r1.x, r1.x, l(499.999969)                                // r1.x = r1.x * 500 (to float precision) (this is equivalent to the '/ 2e-3' in my code)
18.   14:     store_structured u0.x, r0.w, l(0), r1.x                        // Store r1.x
19.   15:   endif
20.   16:   add r0.z, r0.z, l(0.010000)                                    // Add dtheta
21.   17:   iadd r0.w, r0.w, l(1)                                        // i++
22.   18: endloop                                                         // end loop, do it all over again
23.   19: ret

Code (CSharp):
1. dcl_constantbuffer CB0[1], immediateIndexed                                // Similar to above
2. .
3. .
4.    0: ine r0.x, vThreadID.x, l(1)                                        // Similar
5.    1: if_nz r0.x                                                    // Similar
6.    2:   ret                                                         // Similar
7.    3: endif                                                         // Similar
8.    4: add r0.xyzw, cb0[0].xxxx, l(0.001000, -0.001000, 0.001000, -0.001000)    // r0.x = _k + 0.001 | r0.y = _k - 0.001 | r0.z = _k + 0.001 | r0.w = _k - 0.001
9.    5: mul r0.yw, r0.yyyw, l(0.000000, -0.161000, 0.000000, -0.151000)        // A bit unsure of the details, but assume it stores 'theta * (_k - 0.001)' for both loop iterations (dtheta = 0.1, which is why we have theta = -0.151 for iteration 2).
10.    6: mad r0.xy, r0.xzxx, l(-0.161000, -0.151000, 0.000000, 0.000000), -r0.ywyy    // Computes 'theta * (_k + 0.001)' for both loop iterations and subtracts 'theta * (_k - 0.001)' which were computed on previous line.
11.    7: mul r0.xy, r0.xyxx, l(499.999969, 499.999969, 0.000000, 0.000000)        // Multiplies with 500 (float presicion) for both loop iterations.
12.    8: store_structured u0.x, l(0), l(0), r0.x                            // Stores loop 1 result
13.    9: store_structured u0.x, l(1), l(0), r0.y                            // Stores loop 2 result
14.   10: ret

unityunity