Search Unity

Resolved Need help understanding compute shaders

Discussion in 'Shaders' started by Buttermilch, Dec 11, 2022.

  1. Buttermilch

    Buttermilch

    Joined:
    Nov 23, 2016
    Posts:
    33
    Using Unity 2021.3.6f1 and URP.
    I'm currently generating a lot of positions (50k) with a compute shader.I'm sending the world position of a transform to the shader and it then just caluclates some positions relative to that world position.

    My problem is now that after I've iterated over 2500 positions, all the others are Vector4.zero.I don't really understand how that happens.

    Here's my compute shader code:

    Code (CSharp):
    1. int _range, _size;
    2. float4 _worldPos;
    3. RWStructuredBuffer<float4> _positionBuffer;
    4. [numthreads(8,8,1)]
    5. void CalcPositions(uint3 id : SV_DispatchThreadID) {
    6.  
    7. if (id.x < 0 || id.x >= _range || id.y < 0 || id.y >= _range) { return; }
    8.  
    9. float4 pos = _worldPos;
    10. pos.xz += (id.xy - float(_range) * 0.5) * _size;
    11. pos.y += 10;
    12.  
    13. _positionBuffer[id.x + id.y * _size] = pos;
    14.  
    15. }

    Code (CSharp):
    1.  
    2. And Csharp code:
    3.  
    4. //_instances = 50000
    5. //_range.x = 50
    6. _positionShader.SetInt("_range", (int)_range.x);
    7. positionShader.SetVector("_worldPos", transform.position);
    8. positionShader.SetInt("_size", 2);
    9.  
    10. //4*sizeof(float) = Vector4
    11. _positionBuffer = new ComputeBuffer(_instances, 4 * sizeof(float));
    12. _positionShader.SetBuffer(0, "_positionBuffer", _positionBuffer);
    13.  
    14. int threadSize = 8; //correct ??
    15. _positionShader.Dispatch(0, threadSize, threadSize, 1);
    16.  
     
  2. Buttermilch

    Buttermilch

    Joined:
    Nov 23, 2016
    Posts:
    33
    I eventually found a tutorial by ronja (https://github.com/ronja-tutorials/ShaderTutorials/blob/master/Assets/050_Compute_Shader/)Now I can calculate thread groups like this:
    Code (CSharp):
    1. Shader.GetKernelThreadGroupSizes(kernel, out threadGroupSize, out _, out _);
    2. int threadGroups = (int)((_instances + (threadGroupSize - 1)) / threadGroupSize);
    And the compute shader is configured with [numthreads(32, 1, 1)].
    Still don't really understand how that works but I'm going to find out someday.
     
  3. burningmime

    burningmime

    Joined:
    Jan 25, 2014
    Posts:
    845
    If your compute shader has
    [numthreads(X, Y, Z)]
    and you call
    Dispatch(A, B, C)
    it dispatches
    A*X
    threads in the X dimension,
    B*Y
    threads in the Y dimension and
    C*Z
    threads in the Z dimension, for a total of
    A*X*B*Y*C*Z
    invocations of your kernel function. Basically, on the CPU when you call
    Dispatch
    it sends out that many groups, and then the
    [numthreads(X, Y, Z)
    tells you how many individual threads are in each group.So for your above code...

    Code (CSharp):
    1.  
    2. // threadGroupSize = 32, because that's what your shader has in the [numthreads()]
    3. Shader.GetKernelThreadGroupSizes(kernel, out threadGroupSize, out _, out _);
    4.  
    5. // threadGroups = (_instances + (32 - 1))/32
    6. // threadGroups = floor((_instances + 31)/32)
    7. // threadGroups = ciel(_instances/32)
    8. int threadGroups = (int)((_instances + (threadGroupSize - 1)) / threadGroupSize);
    9.  
    So if _instances is 50, you dispatch 2 groups, which ends up launching 2*32=64 threads.

    But you only have 50 instances! So what about the other 14 threads? That's why in your shader you have the check
    if(id.x < 0 || id.x >= _range) { return; }
    so that when it gives you index
    53, 1, 1
    or whatever you're not going to crash.

    Yes, it's confusing and probably a bad design, but we're stuck with it.

    By the way (to make it even more confusing), it's best practice to use a multiple of 64 for
    groupsX * groupsY * groupsZ
    . NVIDIA GPUs mostly dispatch things in waves of 32, but on AMD, they dispatch things in waves of 64. So if you have
    [numthreads(32, 1, 1]
    , your AMD GPU will only allocate 32 threads per wave, and it'll be half as fast as it could be. The tl;dr here is just always to use
    numthreads(64,1,1)
    ,
    numthreads(8,8,1)
    or
    numthreads(4,4,4)
    unless you are using groupshared memory or your kernel uses big arrays.
     
    MaxEden and Buttermilch like this.
  4. Buttermilch

    Buttermilch

    Joined:
    Nov 23, 2016
    Posts:
    33