Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

Bug Conditional Order Issues with HLSL to Metal/OpenGLES compute shader compiler

Discussion in 'Shaders' started by Cery_, Apr 27, 2022.

  1. Cery_

    Cery_

    Joined:
    Aug 17, 2012
    Posts:
    47
    While porting some OpenCL kernels to HLSL and implementing a computer vision pipeline (50+ kernels) I've come across a few compiler issues in the steps from HLSL to Metal and OpenGL.
    Theses issues result in different behaviour on different GPUs. Whilst some work (OpenGL + Adreno) and some don't work without throwing an error (OpenGL + Mali), some don't work and report errors (Metal + iOS/M1). Since I don't know how to best create reproducible minimal cases to report these issues I am posting here for bug report advice and clarification.

    The Problem:

    During the compilation from HLSL to the device specific frameworks (Metal + OpenGLES) some if clauses are rewritten in a way that can lead to out of bounds memory access.

    Example cases:

    (Out of bounds memory access is marked in the code)

    1. FXC compiler OpenGLES


    HLSL Code:
    Code (CSharp):
    1. [numthreads(8, 8, 1)]
    2. void labelxPreprocess_int_int(uint3 id : SV_DispatchThreadID) {
    3.   const int x = id.x, y = id.y;
    4.  
    5.   if (x > 5 && buf1[x] == buf1[x-5])
    6.   {
    7.     buf0[x] = 42;
    8.     return;
    9.   }
    10. }

    Compiled OpenGLES Code:
    Code (CSharp):
    1.  
    2. void main()
    3. {
    4.     u_xlatb0 = 5<int(gl_GlobalInvocationID.x);
    5.     u_xlati1 = int(buf1_buf[gl_GlobalInvocationID.x].value[(0 >> 2) + 0]);
    6.     u_xlati2 = int(gl_GlobalInvocationID.x) + int(0xFFFFFFFBu);
    7.     u_xlati2 = int(buf1_buf[u_xlati2].value[(0 >> 2) + 0]);       // Out of Bounds Access
    8.     u_xlatb1 = u_xlati2==u_xlati1;
    9.     u_xlatb0 = u_xlatb1 && u_xlatb0;
    10.     if(u_xlatb0){
    11.         buf0_buf[gl_GlobalInvocationID.x].value[(0 >> 2)] = 42;
    12.         return;
    13.     }
    14.     return;
    15. }
    16.  

    Workaround:
    Code (CSharp):
    1.  
    2. if (x > 5)
    3. {
    4.   if(buf1[x] == buf1[x-5])
    5.   {
    6.     buf0[x] = 42; return;
    7.   }
    8. }
    9.  
    10. // Results in:
    11. u_xlatb0 = 5<int(gl_GlobalInvocationID.x);
    12. if(u_xlatb0){
    13.     u_xlati0 = int(buf1_buf[gl_GlobalInvocationID.x].value[(0 >> 2) + 0]);
    14.     u_xlati1 = int(gl_GlobalInvocationID.x) + int(0xFFFFFFFBu);
    15.     u_xlati1 = int(buf1_buf[u_xlati1].value[(0 >> 2) + 0]);
    16.     u_xlatb0 = u_xlati1==u_xlati0;
    17.     if(u_xlatb0){
    18.         buf0_buf[gl_GlobalInvocationID.x].value[(0 >> 2)] = 42;
    19.         return;
    20.     }
    21. }
    22.  

    What is happening (from my understanding):
    The conditions in the if clause should be evaluated in sequence. For (a && b) the value of b should only be checked if a is true. Due to the reconfiguration in the compiler both the value of a and b are first evaluated and stored in a bool and then the clause a && b is solved. Since I am checking the bounds for the memory access in a i don't want to execute the memory access in b if out of bounds.

    Note about FXC vs DXC:
    If I switch to DXC with the same example the problem is solved since DXC is reordering differently. The resulting code in DXC is as following:

    Code (CSharp):
    1.  
    2. int _25 = int(gl_GlobalInvocationID.x);
    3. uint _27 = uint(_25);
    4. if ((_25 > 5) && (buf1._m0[_27] == buf1._m0[uint(_25 - 5)]))
    5. {
    6.     buf0._m0[_27] = 42;
    7.     break;
    8. }
    9.  


    2. DXC ternary issues

    HLSL code:
    Code (CSharp):
    1.  
    2. [numthreads( THREADS_PER_GROUP, 1, 1 )]
    3. void ScanBucketResult( uint DTid : SV_DispatchThreadID, uint GI: SV_GroupIndex )
    4. {
    5.     uint x = off + DTid > 0 ? _Input[off + DTid*THREADS_PER_GROUP - 1] : 0;
    6.     bucket[GI].x = x;
    7.     GroupMemoryBarrierWithGroupSync();
    8.     _Result[DTid] = bucket[0].y + x;
    9. }
    10.  

    Metal compiled:
    Code (CSharp):
    1.  
    2. kernel void computeMain(constant type_Globals& _Globals [[buffer(0)]], const device type_StructuredBuffer_uint& _Input [[buffer(1)]], device type_RWStructuredBuffer_uint& _Result [[buffer(2)]], uint3 gl_GlobalInvocationID [[thread_position_in_grid]], uint gl_LocalInvocationIndex [[thread_index_in_threadgroup]])
    3. {
    4.     threadgroup uint2 bucket[512];
    5.     uint _44 = uint(_Globals.off);
    6.     uint _51 = _Input._m0[(_44 + (gl_GlobalInvocationID.x * 512u)) - 1u];     // Out of bounds access
    7.     uint _52 = ((_44 + gl_GlobalInvocationID.x) > 0u) ? _51 : 0u;
    8.     ((threadgroup uint*)&bucket[gl_LocalInvocationIndex])[0] = _52;
    9.     threadgroup_barrier(mem_flags::mem_threadgroup);
    10.     _Result._m0[gl_GlobalInvocationID.x] = ((threadgroup uint*)&bucket[0])[1] + _52;
    11. }
    12.  

    Metal compiled without barrier:
    Code (CSharp):
    1.  
    2. kernel void computeMain(constant type_Globals& _Globals [[buffer(0)]], const device type_StructuredBuffer_uint& _Input [[buffer(1)]], device type_RWStructuredBuffer_uint& _Result [[buffer(2)]], uint3 gl_GlobalInvocationID [[thread_position_in_grid]], uint gl_LocalInvocationIndex [[thread_index_in_threadgroup]])
    3. {
    4.     threadgroup uint2 bucket[512];
    5.     uint _42 = uint(_Globals.off);
    6.     uint _50 = ((_42 + gl_GlobalInvocationID.x) > 0u) ? _Input._m0[(_42 + (gl_GlobalInvocationID.x * 512u)) - 1u] : 0u;
    7.     ((threadgroup uint*)&bucket[gl_LocalInvocationIndex])[0] = _50;
    8.     _Result._m0[gl_GlobalInvocationID.x] = ((threadgroup uint*)&bucket[0])[1] + _50;
    9. }
    10.  

    Workaround:
    Code (CSharp):
    1.  
    2. uint x = 0;
    3. if(off + DTid > 0)
    4. {
    5.     x = _Input[off + DTid*THREADS_PER_GROUP - 1];
    6. }
    7. bucket[GI].x = x;
    8. GroupMemoryBarrierWithGroupSync();
    9. _Result[DTid] = bucket[0].y + x;
    10.  
    11. // Results in:
    12. threadgroup uint2 bucket[512];
    13. uint _44 = uint(_Globals.off);
    14. uint _54;
    15. if ((_44 + gl_GlobalInvocationID.x) > 0u)
    16. {
    17.     _54 = _Input._m0[(_44 + (gl_GlobalInvocationID.x * 512u)) - 1u];
    18. }
    19. else
    20. {
    21.     _54 = 0u;
    22. }
    23. ((threadgroup uint*)&bucket[gl_LocalInvocationIndex])[0] = _54;
    24. threadgroup_barrier(mem_flags::mem_threadgroup);
    25. _Result._m0[gl_GlobalInvocationID.x] = ((threadgroup uint*)&bucket[0])[1] + _54;
    26.  

    What is happening:
    In some cases the ternary is not kept and instead rewritten into two temporary variables and a following ternary. Like the case before this results in an out of bounds memory access since memory is requested even thought it should never be reached due to the condition.

    This case is very strange since it is somehow connected to the memory barrier. Without the barrier the ternary is kept inline and we get no invalid memory access.

    With FXC the order is even more messed up. But additionally the workaround does not work here.

    Code (CSharp):
    1.  
    2. uint x = 0;
    3. if(off + DTid > 0)
    4. {
    5.     x = _Input[off + DTid*THREADS_PER_GROUP - 1];
    6. }
    7. bucket[GI].x = x;
    8. bucket[GI].y = 0;
    9. GroupMemoryBarrierWithGroupSync();
    10. _Result[DTid] = bucket[0].y + x;
    11.  
    12. // Results in:
    13. void main()
    14. {
    15.     TGSM0[gl_LocalInvocationIndex].value[(4 >> 2)] = 0u;
    16.     memoryBarrierShared();
    17.     barrier();
    18.     u_xlati0 = int(TGSM0[0].value[(4 >> 2) + 0]);
    19.     u_xlati1 = int(gl_GlobalInvocationID.x) << 9;
    20.     u_xlati1 = u_xlati1 + off;
    21.     u_xlati1 = u_xlati1 + int(0xFFFFFFFFu);
    22.     u_xlati1 = int(_Input_buf[u_xlati1].value[(0 >> 2) + 0]);     // Out of bounds memory access
    23.     u_xlatu2 = gl_GlobalInvocationID.x + uint(off);
    24.     u_xlatb2 = 0u<u_xlatu2;
    25.     u_xlati1 = u_xlatb2 ? u_xlati1 : int(0);
    26.     u_xlati0 = u_xlati1 + u_xlati0;
    27.     _Result_buf[gl_GlobalInvocationID.x].value[(0 >> 2)] = uint(u_xlati0);
    28.     return;
    29. }
    30.  

    DXC seems to fix a lot of the issues happening with FXC but there are a few places where ternary operations are messed up.
    Another situation:
    Code (csharp):
    1.  
    2. // invalid
    3. uint x = (DTid + off) == 0 || (DTid + off) > size ? 0 : _Input[ (DTid + off)-1 ];
    4.  
    5. // valid
    6. uint x = 0;
    7. if(DTid + off > 0 && DTid + off < size)
    8. {
    9.     x = _Input[ (DTid + off)-1 ];
    10. }
    11.  

    Devices + Configurations:

    Tested mostly with unity 2021.2.14f.
    I have tested a few device configurations. Different systems seem to handle the situation differently. Here is my take what is happening:

    OpenGLES + Adreno GPU (Samsung Tab s5e Adreno 615, Pixel 5 Adreno 620)
    It runs successful without errors and with the correct results. I am suspecting that the Adreno doesn't perform any non vital out of bounds memory checks thus just computes all kernels. Since the oob access produces no critical results since the if clauses are still performed afterwards we get a correct result.

    OpenGLES + Mali GPU (Samsung Galaxy S21 exynos/Mali)
    I couldn't perform these tests so i did not do a lot of digging. I just know it doesn't produce the correct result and doesn't throw any errors.

    Metal + M1 (Ipad Pro 2019)
    With metal api validation and with shader validation i was getting errors to debug (IOAF code 11 + code 4).

    On metal I was able to track down the examples provided above. I am suspecting Metal acts similar to Mali GPUs in the case of an oob memory access. It cancels the critical command buffers and then continues execution - resulting in wrong values.

    With DXC and with the workarounds noted above I was able to get it running on Metal. I am guessing I can fix the Mali issues with the workarounds I have mentioned but chained if statements are quite common in the 50+ kernels and would be a real pain to fix...

    Additional thoughts:
    With the reordering performed it seems obviously bugged. But on second glance it might not be that simple. In most cases the next compiler taking the OpenGLES or Metal shader code could in theory be able to regain correct execution by optimizing access. Operations like the problematic memory access might be skipped. Similarly the original reordering might be the result of an optimisation step. I have not enough knowledge about the pipeline and I am pretty confused by all the steps (Unity->HLSL->Metal->Shader Assembly->Runtime...?!)

    What to report?
    As the behaviour is quite complex and different hardware throws different errors or no errors at all it is hard to construct a simple repro project. On Mali GPUs for example it just looks like it did work but produces no useful output. So now the question what should I report and how?
     
  2. aleksandrk

    aleksandrk

    Unity Technologies

    Joined:
    Jul 3, 2017
    Posts:
    2,845
    Hi!
    FXC: The reordering is performed inside FXC, so we cannot do anything about it. It's a bug in the compiler itself.
    DXC: it would be great if you made a bug report :)
    Thanks for such a detailed post!
     
  3. Cery_

    Cery_

    Joined:
    Aug 17, 2012
    Posts:
    47
    Thanks for the info!

    I've created the DXC Bugreport (Case 1424847):
    https://issuetracker.unity3d.com/is...orted-error-is-thrown-when-using-dxc-compiler

    Regarding FXC:
    Thats too bad, I guess I will need to implement the workaround I mentioned.
    (I am guessing DXC OpenGL support is still quite along ways away?)
    Interestingly it might not even be considered a bug. "Logical Operator Short Circuiting" seems to be missing from pre 2021 HLSL. In the DXC announcement thread Przemyslaw_Zaworski linked the HLSL 2021 announcement:
    https://devblogs.microsoft.com/directx/announcing-hlsl-2021/
    There "Logical Operator Short Circuiting" is introduced as a new feature.

    Here an excerpt from the HLSL docs:
    "Unlike short-circuit evaluation of &&, ||, and ?: in C, HLSL expressions never short-circuit an evaluation because they are vector operations. All sides of the expression are always evaluated."
    https://docs.microsoft.com/en-us/wi...rs?redirectedfrom=MSDN#Boolean_Math_Operators

    I am so used to this behaviour of compilers that it didn't come to my mind that it might not be a feature of every compiler.
     
    Last edited: May 9, 2022
  4. aleksandrk

    aleksandrk

    Unity Technologies

    Joined:
    Jul 3, 2017
    Posts:
    2,845
    DXC support is still experimental. It's being actively worked on, but it takes quite a while to upgrade a compiler stack :D
    Thank you for the bug report!