It is possible to have a multipass Compute Shader?

carcasanchez · Apr 16, 2021

I have an image I need to apply two effects, via Compute Shader: one for Grayscaling, and other for Color Grading. For the second filter I need the first to be already applied, so I would need two passes on my Compute Shader. It is even possible? I have been searching the web for info, and I have found none.
I know I can make two calls from C# to different kernels, but I would like if I can spare me some CPU to GPU operations.

bgolus · Apr 16, 2021

Yes and no. You can't do multiple separate kernels from a single dispatch, but a single kernel can do more than one thing. You can tell the code to wait until all threads are at the same point in the code with
AllMemoryBarrierWithGroupSync()
https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/allmemorybarrierwithgroupsync

BattleAngelAlita · Apr 17, 2021

Why you not just apply this in one shader?

carcasanchez · Apr 19, 2021

bgolus said: ↑
Yes and no. You can't do multiple separate kernels from a single dispatch, but a single kernel can do more than one thing. You can tell the code to wait until all threads are at the same point in the code with
AllMemoryBarrierWithGroupSync()
https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/allmemorybarrierwithgroupsync
Click to expand...
As far as I know, this won't work for me. My shader has to iterate through all texture's pixels, so even if I force all threads to be syncronized, it doesn't guarantee the whole texture has been processed before applying the second "pass" (unless I have one thread per pixel, which I will be surprised if this is even possible).

BattleAngelAlita said: ↑

Why you not just apply this in one shader?
Click to expand...

My first pass is a grayscale filter, and my second "pass" is a gradient operation that requires to look to adjacent pixels. If I do both operations at once, the gradient will pick pixels that are not still grayscaled. I have to wait to the grayscale filter to complete fully.

This is my shader, btw.

Code (CSharp):

#pragma kernel CSMain

// ( CPU -> GPU )

Texture2D inputTexture;

// ( GPU -> CPU )

RWTexture2D <float4> outputTexture;

int textWidth;

int textHeight;

[numthreads(8, 8, 1)]

void CSMain(uint3 id : SV_DispatchThreadID)

{

float R = inputTexture[id.xy].r;

float G = inputTexture[id.xy].g;

float B = inputTexture[id.xy].b;

//Grayscale

float l = R * 0.299 + G * 0.587 + B * 0.114;

float3 outputColor = float3(l, l, l);

//

//Gradient

if (id.x == 0 || id.y == 0 || id.x == textWidth - 1 || id.y == textHeight - 1)

outputColor = float3(0, 0, 0);

else

{

float3 c = outputTexture[float2(id.x - 1, id.y - 1)];

float3 cSampleNegXNegY = float3(c.r, c.g, c.g);

c = outputTexture[float2(id.x, id.y - 1)];

float3 cSampleZerXNegY = float3(c.r, c.g, c.g);

c = outputTexture[float2(id.x + 1, id.y - 1)];

float3 cSamplePosXNegY = float3(c.r, c.g, c.g);

c = outputTexture[float2(id.x - 1, id.y)];

float3 cSampleNegXZerY = float3(c.r, c.g, c.g);

c = outputTexture[float2(id.x + 1, id.y)];

float3 cSamplePosXZerY = float3(c.r, c.g, c.g);

c = outputTexture[float2(id.x - 1, id.y + 1)];

float3 cSampleNegXPosY = float3(c.r, c.g, c.g);

c = outputTexture[float2(id.x, id.y + 1)];

float3 cSampleZerXPosY = float3(c.r, c.g, c.g);

c = outputTexture[float2(id.x + 1, id.y + 1)];

float3 cSamplePosXPosY = float3(c.r, c.g, c.g);

float fSampleNegXNegY = cSampleNegXNegY.x;

float fSampleZerXNegY = cSampleZerXNegY.x;

float fSamplePosXNegY = cSamplePosXNegY.x;

float fSampleNegXZerY = cSampleNegXZerY.x;

float fSamplePosXZerY = cSamplePosXZerY.x;

float fSampleNegXPosY = cSampleNegXPosY.x;

float fSampleZerXPosY = cSampleZerXPosY.x;

float fSamplePosXPosY = cSamplePosXPosY.x;

float edgeX = (fSampleNegXNegY - fSamplePosXNegY) * 0.25 + (fSampleNegXZerY - fSamplePosXZerY) * 0.5 + (fSampleNegXPosY - fSamplePosXPosY) * 0.25;

float edgeY = (fSampleNegXNegY - fSampleNegXPosY) * 0.25 + (fSampleZerXNegY - fSampleZerXPosY) * 0.5 + (fSamplePosXNegY - fSamplePosXPosY) * 0.25;

float fValue = (edgeX + edgeY) / 2.0f;

fValue = 1 - fValue / gradientStrenght;

outputColor = new Color(fValue, fValue, fValue, tc.a);

}

//

outputTexture[id.xy] = outputColor;

}

BattleAngelAlita · Apr 20, 2021

carcasanchez said: ↑

adjacent
Click to expand...

For group of 8*8 threads you can load 10*10 pixels to LDS, and then process it.

Arycama · Apr 21, 2021

There's no way to do a global sync in Compute Shaders, you must do another dispatch call. However dispatch calls are very cheap for both the CPU and GPU, so I wouldn't be concerned about performance. (As noted by Nvidia in slide 26 of this presentation: https://www.nvidia.com/content/GTC-2010/pdfs/2260_GTC2010.pdf)

BattleAngelAlita said: ↑

For group of 8*8 threads you can load 10*10 pixels to LDS, and then process it.
Click to expand...

Yep, though you will have divergent threads, as some will need to load/store more than one pixel. For maximum performance, it's good to avoid divergent workloads within a thread group. A better idea might be to use a 16*16 (or even 32*32) kernel size, with some overlapping work groups, so that all threads participate in loading a pixel and storing it in LDS, but only the non-border pixels write their final results to the target texture.
bgolus said: ↑
Yes and no. You can't do multiple separate kernels from a single dispatch, but a single kernel can do more than one thing. You can tell the code to wait until all threads are at the same point in the code with
AllMemoryBarrierWithGroupSync()
https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/allmemorybarrierwithgroupsync
Click to expand...
Sorry if I'm misunderstanding what you're saying, but AllMemoryBarrierWithGroupSync isn't a global thread sync. It allows threads to view writes to device memory (For example to a RWBuffer/Texture, which must also be marked globally coherent), but it only synchronises within the current thread group.

The reason why there's no global thread sync is because:
a) It's expensive architecture and performance-wise
b) A compute shader may not execute all at once. You can declare up to 1024 threads per kernel (32x32), and dispatch up to 65536 thread groups per-dimension, which is much more than a GPU can run simultaneously. So it will run as many thread groups as it can, and then once some of them complete, it will run more.

If there was a global sync, the GPU would need to be able to run all of your thread groups simultaneously.

Search Unity

Unity ID

Useful Searches

It is possible to have a multipass Compute Shader?

carcasanchez

bgolus

BattleAngelAlita

carcasanchez

BattleAngelAlita

Arycama