Search Unity

Curious behavior in compute shader - Help needed

Discussion in 'Shaders' started by SunnySunshine, Oct 6, 2017.

  1. SunnySunshine

    SunnySunshine

    Joined:
    May 18, 2009
    Posts:
    976
    I'm playing around with a custom raycaster running on a compute shader. Everything has been working fine up until now, when I decided to add normals to the raycast result.

    Here's the portion that's causing problem:

    Code (CSharp):
    1. struct Intersection
    2. {
    3.     float3 intersection;
    4.     float3 normal;
    5.     int triangleIndex;
    6. };
    7.  
    8. AppendStructuredBuffer<Intersection> result;
    9.  
    10. [numthreads(1024,1,1)]
    11. void CSMain (uint3 id : SV_DispatchThreadID)
    12. {
    13.     [...] // Preparation etc.
    14.  
    15.     bool didHit = CustomRaycast(..., out intersection); // This is working fine and has the expected performance;
    16.  
    17.     if (didHit)
    18.     {
    19.         float3 normal = GetNormal(intersection, vertexIndex0, vertexIndex1, vertexIndex2); // Uses Barycentric interpolation to get the intersection point's normal.
    20.      
    21.         Intersection res;
    22.         res.intersection = intersection;
    23.         res.triangleIndex = id.x;
    24.         res.normal = normal; // Causes extreme slowdown.
    25.         // res.normal = float3(0,0,0); // Uncomment this to undo slowdown, even if lines above are unchanged.
    26.      
    27.         result.Append(res);
    28.     }
    29. }
    For whatever reason, when writing into the normal section of a result struct, the framerate goes from 800 fps to 40 fps.

    If writing to res.normal again with another value, the performance goes back up again. Previous lines can remain unchanged and still the performance will shoot up immensely.

    I have no clue what's going on here. Since the speed goes back up simply by writing into res.normal again, it cannot be GetNormal() that's causing any slowdowns, can it?

    This slowdown does not seem to occur if I only calculate the normal (i.e. not raycasting). Likewise, if only doing raycasting, but no normal calculation, the speed is fine.

    It feels like I'm hitting some kind of threshold, and when surpassing it performance is affected greatly.
     
    Last edited: Oct 6, 2017
  2. Michal_

    Michal_

    Joined:
    Jan 14, 2015
    Posts:
    365
    It's impossible to tell you where exactly your problem is without seeing/profiling full source code but there are three things I can tell you.

    1. Your shader will be heavily optimized during compilation and all unused code will be stripped. If you set res.normal to zero then your normal variable is never used and will be discarded along with the GetNormal function call. That explains why performance goes up when you uncomment "res.normal = float3(0,0,0)"

    2. Structured buffers are tightly packed by the definition. That means your buffer has a stride of 28 bytes. The fact that it is not aligned to 32 byte stride means that intersection and normal variables will often span cache lines, and it will be more expensive to read/write to them.
    Structure size should always be divisible by 16 bytes (size of float4). Internal alignment of vector types matters too. Imagine GPU memory as an array of float4s. Your vector type should never span two elements of that array.
    Code (CSharp):
    1. struct Foo // wrong
    2. {
    3.     float v1;
    4.     float4 v2; // wrong internal alignment
    5.     float3 v3;
    6. }
    7.  
    8. struct Foo // ok
    9. {
    10.     float4 v2;
    11.     float3 v3;
    12.     float v1;
    13. }
    So in your case, you should make intersection variable float4 instead of float3 and keep the w component unused. Or add dummy float variable after it as a padding.

    3. "if-else statement" can dramatically reduce performance. The reason for that is hidden in the way modern GPUs operate. GPU threads are organized in thread groups (called warps or wavefronts by NVidia and AMD respectively). Threads in a thread group run in parallel but they have to execute identical code. That means they can't deal with branching. They always have to execute the same instruction at the same time.
    If your code has a if-else statement then all threads in a thread group execute the "if branch" first. If one of the threads was supposed to execute "else branch" instead then all of the threads will now execute the "else branch". Individual threads will keep results from the first or second run based on the condition.
    Here's a simple explanation of threads warps. Google "gpu warp divergence" or "gpu wavefront divergence" for more information.
     
    SunnySunshine likes this.
  3. SunnySunshine

    SunnySunshine

    Joined:
    May 18, 2009
    Posts:
    976
    Awesome stuff Michal_!

    I had a feeling something like that optimization was happening, but I had no idea the compilation was so smart and efficient. That's really cool actually!

    By looking at the delta mush project at bitbucket, I discovered I wasn't skinning the mesh an in optimal fashion. When I started using their method, frames went up considerably. I suppose mul should be used with care. These other tips you gave is something I'm definitely going to look into as well.

    Thanks!