Search Unity

It is possible to have a multipass Compute Shader?

Discussion in 'Shaders' started by carcasanchez, Apr 16, 2021.

  1. carcasanchez

    carcasanchez

    Joined:
    Jul 15, 2018
    Posts:
    177
    I have an image I need to apply two effects, via Compute Shader: one for Grayscaling, and other for Color Grading. For the second filter I need the first to be already applied, so I would need two passes on my Compute Shader. It is even possible? I have been searching the web for info, and I have found none.
    I know I can make two calls from C# to different kernels, but I would like if I can spare me some CPU to GPU operations.
     
  2. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,352
  3. BattleAngelAlita

    BattleAngelAlita

    Joined:
    Nov 20, 2016
    Posts:
    400
    Why you not just apply this in one shader?
     
  4. carcasanchez

    carcasanchez

    Joined:
    Jul 15, 2018
    Posts:
    177
    As far as I know, this won't work for me. My shader has to iterate through all texture's pixels, so even if I force all threads to be syncronized, it doesn't guarantee the whole texture has been processed before applying the second "pass" (unless I have one thread per pixel, which I will be surprised if this is even possible).

    My first pass is a grayscale filter, and my second "pass" is a gradient operation that requires to look to adjacent pixels. If I do both operations at once, the gradient will pick pixels that are not still grayscaled. I have to wait to the grayscale filter to complete fully.


    This is my shader, btw.
    Code (CSharp):
    1. #pragma kernel CSMain
    2.  
    3. // ( CPU -> GPU )
    4. Texture2D inputTexture;
    5.  
    6. // ( GPU -> CPU )
    7. RWTexture2D <float4> outputTexture;
    8.  
    9. int textWidth;
    10. int textHeight;
    11.  
    12. [numthreads(8, 8, 1)]
    13. void CSMain(uint3 id : SV_DispatchThreadID)
    14. {
    15.     float R = inputTexture[id.xy].r;
    16.     float G = inputTexture[id.xy].g;
    17.     float B = inputTexture[id.xy].b;
    18.  
    19.     //Grayscale
    20.     float l = R * 0.299 + G * 0.587 + B * 0.114;
    21.     float3 outputColor = float3(l, l, l);
    22.     //
    23.  
    24.     //Gradient
    25.     if (id.x == 0 || id.y == 0 || id.x == textWidth - 1 || id.y == textHeight - 1)
    26.         outputColor = float3(0, 0, 0);
    27.     else
    28.     {
    29.         float3 c = outputTexture[float2(id.x - 1, id.y - 1)];
    30.         float3 cSampleNegXNegY = float3(c.r, c.g, c.g);
    31.         c = outputTexture[float2(id.x, id.y - 1)];
    32.         float3 cSampleZerXNegY = float3(c.r, c.g, c.g);
    33.         c = outputTexture[float2(id.x + 1, id.y - 1)];
    34.         float3 cSamplePosXNegY = float3(c.r, c.g, c.g);
    35.         c = outputTexture[float2(id.x - 1, id.y)];
    36.         float3 cSampleNegXZerY = float3(c.r, c.g, c.g);
    37.         c = outputTexture[float2(id.x + 1, id.y)];
    38.         float3 cSamplePosXZerY = float3(c.r, c.g, c.g);
    39.         c = outputTexture[float2(id.x - 1, id.y + 1)];
    40.         float3 cSampleNegXPosY = float3(c.r, c.g, c.g);
    41.         c = outputTexture[float2(id.x, id.y + 1)];
    42.         float3 cSampleZerXPosY = float3(c.r, c.g, c.g);
    43.         c = outputTexture[float2(id.x + 1, id.y + 1)];
    44.         float3 cSamplePosXPosY = float3(c.r, c.g, c.g);
    45.  
    46.         float fSampleNegXNegY = cSampleNegXNegY.x;
    47.         float fSampleZerXNegY = cSampleZerXNegY.x;
    48.         float fSamplePosXNegY = cSamplePosXNegY.x;
    49.         float fSampleNegXZerY = cSampleNegXZerY.x;
    50.         float fSamplePosXZerY = cSamplePosXZerY.x;
    51.         float fSampleNegXPosY = cSampleNegXPosY.x;
    52.         float fSampleZerXPosY = cSampleZerXPosY.x;
    53.         float fSamplePosXPosY = cSamplePosXPosY.x;
    54.  
    55.         float edgeX = (fSampleNegXNegY - fSamplePosXNegY) * 0.25 + (fSampleNegXZerY - fSamplePosXZerY) * 0.5 + (fSampleNegXPosY - fSamplePosXPosY) * 0.25;
    56.         float edgeY = (fSampleNegXNegY - fSampleNegXPosY) * 0.25 + (fSampleZerXNegY - fSampleZerXPosY) * 0.5 + (fSamplePosXNegY - fSamplePosXPosY) * 0.25;
    57.  
    58.         float fValue = (edgeX + edgeY) / 2.0f;
    59.         fValue = 1 - fValue / gradientStrenght;
    60.  
    61.         outputColor = new Color(fValue, fValue, fValue, tc.a);
    62.  
    63.     }
    64.     //
    65.     outputTexture[id.xy] = outputColor;
    66. }
    67.  
     
  5. BattleAngelAlita

    BattleAngelAlita

    Joined:
    Nov 20, 2016
    Posts:
    400
    For group of 8*8 threads you can load 10*10 pixels to LDS, and then process it.
     
    Arycama likes this.
  6. Arycama

    Arycama

    Joined:
    May 25, 2014
    Posts:
    185
    There's no way to do a global sync in Compute Shaders, you must do another dispatch call. However dispatch calls are very cheap for both the CPU and GPU, so I wouldn't be concerned about performance. (As noted by Nvidia in slide 26 of this presentation: https://www.nvidia.com/content/GTC-2010/pdfs/2260_GTC2010.pdf)

    Yep, though you will have divergent threads, as some will need to load/store more than one pixel. For maximum performance, it's good to avoid divergent workloads within a thread group. A better idea might be to use a 16*16 (or even 32*32) kernel size, with some overlapping work groups, so that all threads participate in loading a pixel and storing it in LDS, but only the non-border pixels write their final results to the target texture.

    Sorry if I'm misunderstanding what you're saying, but AllMemoryBarrierWithGroupSync isn't a global thread sync. It allows threads to view writes to device memory (For example to a RWBuffer/Texture, which must also be marked globally coherent), but it only synchronises within the current thread group.

    The reason why there's no global thread sync is because:
    a) It's expensive architecture and performance-wise
    b) A compute shader may not execute all at once. You can declare up to 1024 threads per kernel (32x32), and dispatch up to 65536 thread groups per-dimension, which is much more than a GPU can run simultaneously. So it will run as many thread groups as it can, and then once some of them complete, it will run more.

    If there was a global sync, the GPU would need to be able to run all of your thread groups simultaneously.
     
    Arithmetica, asdzxcv777 and bgolus like this.