Check if a ComputeShader.Dispatch() command is completed on GPU before doing second kernel dispatch

joergzdarsky · Nov 23, 2015

Hi,

I am trying to move from my CPU based procedural planet generation approach to a GPU based (when it comes to plane calculation and rendering).
Being fairly new new to shader programing at all, I am right now at the stage that I can hand over generation constants in a buffer to a compute shader, precalulcate the plane in a compute shader (vertice positions, norrmals, based on noise) and hand them over via the buffer for rendering (replacing the vertex positions of a prototype mesh with the ones from the buffer).

But right now I havent been able to render more than one plane at once. Guess this is due to some dependency I am not aware of (maybe the order of creating the buffers, materials, DrawMesh calls), or maybe I in the end need one gameobject per DrawMesh call?!.

Any hint on what I might be doing wrong would be very helpfull and appreciated.

So right now my (not working) approach is I moved the buffers into the **QuadtreeTerrain** class (a quadtree node), as well as the material (not sure if individual materials are necessary).

Code (CSharp):

class QuadtreeTerrain {

// Quadtree classes

public QuadtreeTerrain parentNode; // The parent quadtree node

public QuadtreeTerrain childNode1; // A children quadtree node

public QuadtreeTerrain childNode2; // A children quadtree node

public QuadtreeTerrain childNode3; // A children quadtree node

public QuadtreeTerrain childNode4; // A children quadtree node

// Buffer

public ComputeBuffer generationConstantsBuffer;

public ComputeBuffer patchGeneratedDataBuffer;

// Material

public Material material;

....

}

In the **SpaceObjectProceduralPlanet** script, applied to a single game object, I hold six instances of quadtrees [=QuadtreeTerrain] then.

Code (CSharp):

public class SpaceObjectProceduralPlanet : MonoBehaviour {

....

// QuadtreeTerrain

private QuadtreeTerrain quadtreeTerrain1;

private QuadtreeTerrain quadtreeTerrain2;

private QuadtreeTerrain quadtreeTerrain3;

private QuadtreeTerrain quadtreeTerrain4;

private QuadtreeTerrain quadtreeTerrain5;

private QuadtreeTerrain quadtreeTerrain6;

// We initialize the buffers and the material used to draw.

void Start()

{

...

// QuadtreeTerrain

this.quadtreeTerrain1 = new QuadtreeTerrain(0, edgeVector1, edgeVector2, edgeVector3, edgeVector4, quadtreeTerrainParameter1);

this.quadtreeTerrain2 = new QuadtreeTerrain(0, edgeVector2, edgeVector5, edgeVector4, edgeVector7, quadtreeTerrainParameter2);

this.quadtreeTerrain3 = new QuadtreeTerrain(0, edgeVector5, edgeVector6, edgeVector7, edgeVector8, quadtreeTerrainParameter3);

this.quadtreeTerrain4 = new QuadtreeTerrain(0, edgeVector6, edgeVector1, edgeVector8, edgeVector3, quadtreeTerrainParameter4);

this.quadtreeTerrain5 = new QuadtreeTerrain(0, edgeVector6, edgeVector5, edgeVector1, edgeVector2, quadtreeTerrainParameter5);

this.quadtreeTerrain6 = new QuadtreeTerrain(0, edgeVector3, edgeVector4, edgeVector8, edgeVector7, quadtreeTerrainParameter6);

CreateBuffers(this.quadtreeTerrain1);

CreateBuffers(this.quadtreeTerrain2);

CreateBuffers(this.quadtreeTerrain3);

CreateBuffers(this.quadtreeTerrain4);

CreateBuffers(this.quadtreeTerrain5);

CreateBuffers(this.quadtreeTerrain6);

CreateMaterial(this.quadtreeTerrain1);

CreateMaterial(this.quadtreeTerrain2);

CreateMaterial(this.quadtreeTerrain3);

CreateMaterial(this.quadtreeTerrain4);

CreateMaterial(this.quadtreeTerrain5);

CreateMaterial(this.quadtreeTerrain6);

Dispatch(this.quadtreeTerrain1);

Dispatch(this.quadtreeTerrain2);

Dispatch(this.quadtreeTerrain3);

Dispatch(this.quadtreeTerrain4);

Dispatch(this.quadtreeTerrain5);

Dispatch(this.quadtreeTerrain6);

}

// We compute the buffers.

void CreateBuffers(QuadtreeTerrain quadtreeTerrain)

{

.... preparing generation constants

quadtreeTerrain.generationConstantsBuffer.SetData(generationConstants);

// Buffer Output

quadtreeTerrain.patchGeneratedDataBuffer = new ComputeBuffer(nVerts, 16 + 12 + 4 + 12);

}

//We create the material

void CreateMaterial(QuadtreeTerrain quadtreeTerrain)

{

Material material = new Material(shader);

material.SetTexture("_MainTex", this.texture);

material.SetFloat("_Metallic", 0);

material.SetFloat("_Glossiness", 0);

quadtreeTerrain.material = material;

}

//We dispatch threads of our CSMain1 and CSMain2 kernels.

void Dispatch(QuadtreeTerrain quadtreeTerrain)

{

// Set Buffers

computeShader.SetBuffer(_kernel, "generationConstantsBuffer", quadtreeTerrain.generationConstantsBuffer);

computeShader.SetBuffer(_kernel, "patchGeneratedDataBuffer", quadtreeTerrain.patchGeneratedDataBuffer);

// Dispatch first kernel

_kernel = computeShader.FindKernel("CSMain1");

computeShader.Dispatch(_kernel, THREADGROUP_SIZE_X, THREADGROUP_SIZE_Y, THREADGROUP_SIZE_Z);

// Dispatch second kernel

_kernel = computeShader.FindKernel("CSMain2");

computeShader.Dispatch(_kernel, THREADGROUP_SIZE_X, THREADGROUP_SIZE_Y, THREADGROUP_SIZE_Z);

}

// We set the material before drawing and call DrawMesh on OnRenderObject

void OnRenderObject()

{

this.quadtreeTerrain1.material.SetBuffer("patchGeneratedDataBuffer", this.quadtreeTerrain1.patchGeneratedDataBuffer);

Graphics.DrawMesh(this.prototypeMesh, transform.localToWorldMatrix, this.quadtreeTerrain1.material, LayerMask.NameToLayer(GlobalVariablesManager.Instance.layerLocalSpaceName), null, 0, null, true, true);

this.quadtreeTerrain2.material.SetBuffer("patchGeneratedDataBuffer", this.quadtreeTerrain2.patchGeneratedDataBuffer);

Graphics.DrawMesh(this.prototypeMesh, transform.localToWorldMatrix, this.quadtreeTerrain2.material, LayerMask.NameToLayer(GlobalVariablesManager.Instance.layerLocalSpaceName), null, 0, null, true, true);

this.quadtreeTerrain3.material.SetBuffer("patchGeneratedDataBuffer", this.quadtreeTerrain3.patchGeneratedDataBuffer);

Graphics.DrawMesh(this.prototypeMesh, transform.localToWorldMatrix, this.quadtreeTerrain3.material, LayerMask.NameToLayer(GlobalVariablesManager.Instance.layerLocalSpaceName), null, 0, null, true, true);

this.quadtreeTerrain4.material.SetBuffer("patchGeneratedDataBuffer", this.quadtreeTerrain4.patchGeneratedDataBuffer);

Graphics.DrawMesh(this.prototypeMesh, transform.localToWorldMatrix, this.quadtreeTerrain4.material, LayerMask.NameToLayer(GlobalVariablesManager.Instance.layerLocalSpaceName), null, 0, null, true, true);

this.quadtreeTerrain5.material.SetBuffer("patchGeneratedDataBuffer", this.quadtreeTerrain5.patchGeneratedDataBuffer);

Graphics.DrawMesh(this.prototypeMesh, transform.localToWorldMatrix, this.quadtreeTerrain5.material, LayerMask.NameToLayer(GlobalVariablesManager.Instance.layerLocalSpaceName), null, 0, null, true, true);

this.quadtreeTerrain6.material.SetBuffer("patchGeneratedDataBuffer", this.quadtreeTerrain6.patchGeneratedDataBuffer);

Graphics.DrawMesh(this.prototypeMesh, transform.localToWorldMatrix, this.quadtreeTerrain6.material, LayerMask.NameToLayer(GlobalVariablesManager.Instance.layerLocalSpaceName), null, 0, null, true, true);

}

//When this GameObject is disabled we must release the buffers.

private void OnDisable()

{

ReleaseBuffer();

}

//Release buffers and destroy the material when play has been stopped.

void ReleaseBuffer()

{

// Destroy everything recursive in the quadtrees.

this.quadtreeTerrain1.generationConstantsBuffer.Release();

this.quadtreeTerrain1.patchGeneratedDataBuffer.Release();

this.quadtreeTerrain2.generationConstantsBuffer.Release();

this.quadtreeTerrain2.patchGeneratedDataBuffer.Release();

this.quadtreeTerrain3.generationConstantsBuffer.Release();

this.quadtreeTerrain3.patchGeneratedDataBuffer.Release();

this.quadtreeTerrain4.generationConstantsBuffer.Release();

this.quadtreeTerrain4.patchGeneratedDataBuffer.Release();

this.quadtreeTerrain5.generationConstantsBuffer.Release();

this.quadtreeTerrain5.patchGeneratedDataBuffer.Release();

this.quadtreeTerrain6.generationConstantsBuffer.Release();

this.quadtreeTerrain6.patchGeneratedDataBuffer.Release();

DestroyImmediate(this.quadtreeTerrain1.material);

DestroyImmediate(this.quadtreeTerrain2.material);

DestroyImmediate(this.quadtreeTerrain3.material);

DestroyImmediate(this.quadtreeTerrain4.material);

DestroyImmediate(this.quadtreeTerrain5.material);

DestroyImmediate(this.quadtreeTerrain6.material);

}

void Update() {

// Do nothing

}

}

Of course this is very bruteforce, but well this should work before I proceed as I need to figure out how to handle the buffers and draw calls and where to put them.

joergzdarsky · Nov 25, 2015

I tried to get closer to the problem. It seems to depend which "Dispatch()" call I do first to precalculate a computebuffer in one computeshader before it is sent to the vertex buffer.
It seems the first ComputeBuffer.Dispatch() call overrules all following ones,. as only the results from the first call are drawn. Although from my point of unserstanding I am using different buffers.
Edit: To be more precise: Both meshes are drawn but they seem to share the same locations and probably buffer. I noticed that as the rendered triangles doubled with each Graphics.DrawMesh added.

Each "QuadtreeTerrain" class has two compute buffer references.

Code (CSharp):

class QuadtreeTerrain {

public ComputeBuffer generationConstantsBuffer;

public ComputeBuffer patchGeneratedDataBuffer;

}

In SpaceObjectProceduralPlanet I initialize the buffers (call CreateBuffer(QuadtreeTerrain) in Start(), where in the called functions the buffers of each object are initialized ("new ComputeBuffer()"). Afterwards I dispatch each buffer by calling (in Start() ) the function Dispatch(QuadtreeTerrain).
Then, in "OnRenderObject()" I sent the buffers to the renderer.
For the ease of read I reduced the number of different buffers to 2.
Any hint why the first Dispatch() call overrules all others is very much appreciated.

Code (CSharp):

using UnityEngine;

using System.Collections;

using System.Threading;

using System.Collections.Generic;

[RequireComponent(typeof(GameObject))]

public class SpaceObjectProceduralPlanet : MonoBehaviour {

public int seed;

public Position position;

public string name;

public float radius;

public float diameter;

public Transform m_Transform;

private int LOD;

// Primitive

private AbstractPrimitive primitive;

private enum PrimitiveState { IN_PRECALCULATION, PRECALCULATED, DONE };

private PrimitiveState primitiveState;

// QuadtreeTerrain

private QuadtreeTerrain quadtreeTerrain1;

private QuadtreeTerrain quadtreeTerrain2;

// Plane Template

public Mesh prototypeMesh;

public Mesh prototypeMesh2;

public Plane plane;

public Texture2D texture;

// ComputeShader

public Shader shader;

public ComputeShader computeShader;

private ComputeBuffer generationConstantsBuffer;

private ComputeBuffer patchGeneratedDataBuffer;

private int _kernel;

// Constants

public static int nVertsPerEdge { get { return 224; } } //Should be multiple of 32

public static int nVerts { get { return nVertsPerEdge * nVertsPerEdge; } }

public int THREADS_PER_GROUP_X { get { return 32; } }

public int THREADS_PER_GROUP_Y { get { return 32; } }

public int THREADGROUP_SIZE_X { get { return nVertsPerEdge / THREADS_PER_GROUP_X; } }

public int THREADGROUP_SIZE_Y { get { return nVertsPerEdge / THREADS_PER_GROUP_Y; } }

public int THREADGROUP_SIZE_Z { get { return 1; } }

struct PatchGenerationConstantsStruct

{

public int nVertsPerEdge;

public float scale;

public float spacing;

public Vector3 patchCubeCenter;

public Vector3 cubeFaceEastDirection;

public Vector3 cubeFaceNorthDirection;

public float planetRadius;

public float terrainMaxHeight;

public float noiseSeaLevel;

public float noiseSnowLevel;

}

struct patchGeneratedDataStruct

{

public Vector4 position;

public Vector3 normal;

public float noise;

public Vector3 patchCenter;

}

// Initial call. We setup the shaders and prototype meshes here.

void Awake () {

// Transform

m_Transform = transform;

// Mesh prottype

this.prototypeMesh = MeshServiceProvider.setupNavyFishDummyMesh(nVertsPerEdge);

this.prototypeMesh2 = MeshServiceProvider.setupNavyFishDummyMesh(nVertsPerEdge);

// Plane Template (not used right now as we have the prototype mesh)

this.plane = new Plane(nVertsPerEdge, Vector3.back);

// Shader

this.shader = Shader.Find("Custom/ProceduralPatch3");

// ComputeShader

this.computeShader = (ComputeShader)Resources.Load("Shaders/Space/Planet/Custom/ProceduralPatchCompute3");

// Texture

this.texture = (Texture2D)Resources.Load("Textures/space/planets/seamless/QuadtreeTerrainTexture.MugDry_1024") as Texture2D;

}

// We initialize the buffers and the material used to draw.

void Start()

{

// Edge coordinates for initialization

Vector3 edgeVector1 = new Vector3(-1, +1, -1);

Vector3 edgeVector2 = new Vector3(+1, +1, -1);

Vector3 edgeVector3 = new Vector3(-1, -1, -1);

Vector3 edgeVector4 = new Vector3(+1, -1, -1);

Vector3 edgeVector5 = new Vector3(+1, +1, +1);

Vector3 edgeVector6 = new Vector3(-1, +1, +1);

Vector3 edgeVector7 = new Vector3(+1, -1, +1);

Vector3 edgeVector8 = new Vector3(-1, -1, +1);

// Parameters

QuadtreeTerrainParameter parameter = new QuadtreeTerrainParameter();

parameter.nVertsPerEdge = nVertsPerEdge;

parameter.scale = 2.0f / nVertsPerEdge;

parameter.spacing = 2.0f / nVertsPerEdge;

parameter.planetRadius = 6371.0f; // 6371000.0f; = earth

parameter.terrainMaxHeight = 15.0f;

parameter.noiseSeaLevel = 0.0f;

parameter.noiseSnowLevel = 0.8f;

QuadtreeTerrainParameter quadtreeTerrainParameter1 = parameter.clone();

quadtreeTerrainParameter1.cubeFaceEastDirection = new Vector3(1, 0, 0);

quadtreeTerrainParameter1.cubeFaceNorthDirection = new Vector3(0, 1, 0);

QuadtreeTerrainParameter quadtreeTerrainParameter2 = parameter.clone();

quadtreeTerrainParameter2.cubeFaceEastDirection = new Vector3(0, 0, 1);

quadtreeTerrainParameter2.cubeFaceNorthDirection = new Vector3(0, 1, 0);

// QuadtreeTerrain

this.quadtreeTerrain1 = new QuadtreeTerrain(0, edgeVector1, edgeVector2, edgeVector3, edgeVector4, quadtreeTerrainParameter1);

this.quadtreeTerrain2 = new QuadtreeTerrain(0, edgeVector2, edgeVector5, edgeVector4, edgeVector7, quadtreeTerrainParameter2);

CreateBuffers(this.quadtreeTerrain1);

CreateBuffers(this.quadtreeTerrain2);

CreateMaterial(this.quadtreeTerrain1);

CreateMaterial(this.quadtreeTerrain2);

// Only the mesh is drawn where there has been the first Dispatch(..) call. E.g. if the first call is commented out, the second mesh (QuadtreeTerrain2) is drawn.

//Dispatch(this.quadtreeTerrain1);

Dispatch(this.quadtreeTerrain2);

}

void Update()

{

}

// We compute the buffers.

void CreateBuffers(QuadtreeTerrain quadtreeTerrain)

{

// Buffer Patch Generation Constants

quadtreeTerrain.generationConstantsBuffer = new ComputeBuffer(4, // 1x int (4 bytes) for one index, index = 0

4 + // nVertsPerEdge (int = 4 bytes),

4 + // scale (float = 4 bytes),

4 + // spacing (float = 4 bytes),

12 + // patchCubeCenter (float3 = 12 bytes),

12 + // cubeFaceEastDirection (float3 = 12 bytes),

12 + // cubeFaceNorthDirection (float3 = 12 bytes),

4 + // planetRadius (float = 4 bytes),

4 + // terrainMaxHeight (float = 4 bytes),

4 + // noiseSeaLevel (float = 4 bytes),

4); // noiseSnowLevel (float = 4 bytes),

PatchGenerationConstantsStruct[] generationConstants = new PatchGenerationConstantsStruct[1];

generationConstants[0].nVertsPerEdge = quadtreeTerrain.parameters.nVertsPerEdge;

generationConstants[0].scale = quadtreeTerrain.parameters.scale;

generationConstants[0].spacing = quadtreeTerrain.parameters.spacing;

generationConstants[0].patchCubeCenter = quadtreeTerrain.centerVector;

generationConstants[0].cubeFaceEastDirection = quadtreeTerrain.parameters.cubeFaceEastDirection;

generationConstants[0].cubeFaceNorthDirection = quadtreeTerrain.parameters.cubeFaceNorthDirection;

generationConstants[0].planetRadius = quadtreeTerrain.parameters.planetRadius;

generationConstants[0].terrainMaxHeight = quadtreeTerrain.parameters.terrainMaxHeight;

generationConstants[0].noiseSeaLevel = quadtreeTerrain.parameters.noiseSeaLevel;

generationConstants[0].noiseSnowLevel = quadtreeTerrain.parameters.noiseSnowLevel;

quadtreeTerrain.generationConstantsBuffer.SetData(generationConstants);

// Buffer Output

quadtreeTerrain.patchGeneratedDataBuffer = new ComputeBuffer(nVerts, 16 + 12 + 4 + 12); // Output buffer contains vertice position (float4 = 16 bytes),

// normals (float3 = 12 bytes),

// noise (float = 4 bytes)

// patchCenter (float3 = 12 bytes)

}

//We create the material

void CreateMaterial(QuadtreeTerrain quadtreeTerrain)

{

quadtreeTerrain.material = new Material(shader);

quadtreeTerrain.material.SetTexture("_MainTex", this.texture);

quadtreeTerrain.material.SetFloat("_Metallic", 0);

quadtreeTerrain.material.SetFloat("_Glossiness", 0);

}

//The meat of this script, it sets the buffers for the compute shader.

// We then dispatch threads of our CSMain1 and 2 kernels.

void Dispatch(QuadtreeTerrain quadtreeTerrain)

{

// Set Buffers

computeShader.SetBuffer(_kernel, "generationConstantsBuffer", quadtreeTerrain.generationConstantsBuffer);

computeShader.SetBuffer(_kernel, "patchGeneratedDataBuffer", quadtreeTerrain.patchGeneratedDataBuffer);

// Dispatch first kernel

_kernel = computeShader.FindKernel("CSMain1");

computeShader.Dispatch(_kernel, THREADGROUP_SIZE_X, THREADGROUP_SIZE_Y, THREADGROUP_SIZE_Z);

// Dispatch second kernel

_kernel = computeShader.FindKernel("CSMain2");

computeShader.Dispatch(_kernel, THREADGROUP_SIZE_X, THREADGROUP_SIZE_Y, THREADGROUP_SIZE_Z);

}

//After all rendering is complete we dispatch the compute shader and then set the material before drawing.

void OnRenderObject()

{

this.quadtreeTerrain1.material.SetBuffer("patchGeneratedDataBuffer", this.quadtreeTerrain1.patchGeneratedDataBuffer);

Graphics.DrawMesh(this.prototypeMesh, transform.localToWorldMatrix, this.quadtreeTerrain1.material, LayerMask.NameToLayer(GlobalVariablesManager.Instance.layerLocalSpaceName), null, 0, null, true, true);

this.quadtreeTerrain2.material.SetBuffer("patchGeneratedDataBuffer", this.quadtreeTerrain2.patchGeneratedDataBuffer);

Graphics.DrawMesh(this.prototypeMesh, transform.localToWorldMatrix, this.quadtreeTerrain2.material, LayerMask.NameToLayer(GlobalVariablesManager.Instance.layerLocalSpaceName), null, 0, null, true, true);

}

//When this GameObject is disabled we must release the buffers.

private void OnDisable()

{

ReleaseBuffer();

}

//Release buffers and destroy the material when play has been stopped.

void ReleaseBuffer()

{

// Destroy everything recursive in the quadtrees.

this.quadtreeTerrain1.generationConstantsBuffer.Release();

this.quadtreeTerrain1.patchGeneratedDataBuffer.Release();

this.quadtreeTerrain2.generationConstantsBuffer.Release();

this.quadtreeTerrain2.patchGeneratedDataBuffer.Release();

DestroyImmediate(this.quadtreeTerrain1.material);

DestroyImmediate(this.quadtreeTerrain2.material);

}

}

joergzdarsky · Nov 28, 2015

The problem seems to concentrate of the second Dispatch() call to the second kernel, which I immediately do after the first one.

In CSMain1 I initially calculate the position of a vertex based on some noise.
In CSMain2 I want to calculate the normals and some other things (terraintype etc.)

My problem:
I am not sure when I can do the second Dispatch() call to the second kernel.
If I use the following line of code, the planes (calculated in the first kernel (CSMain1)) do not correctly show up.

// Set Buffers CSMain1
computeShader.SetBuffer(_kernel[0], "generationConstantsBuffer", quadtreeTerrain.generationConstantsBuffer);
computeShader.SetBuffer(_kernel[0], "patchGeneratedDataBuffer", quadtreeTerrain.patchGeneratedDataBuffer);
// Dispatch first kernel CSMain1
computeShader.Dispatch(_kernel[0], THREADGROUP_SIZE_X, THREADGROUP_SIZE_Y, THREADGROUP_SIZE_Z);
// Set Buffers CSMain2
computeShader.SetBuffer(_kernel[1], "generationConstantsBuffer", quadtreeTerrain.generationConstantsBuffer);
computeShader.SetBuffer(_kernel[1], "patchGeneratedDataBuffer", quadtreeTerrain.patchGeneratedDataBuffer);
// Dispatch second kernel CSMain2
computeShader.Dispatch(_kernel[1], THREADGROUP_SIZE_X, THREADGROUP_SIZE_Y, THREADGROUP_SIZE_Z);

It works when I comment the second Dispatch() call out.

// Set Buffers CSMain1
computeShader.SetBuffer(_kernel[0], "generationConstantsBuffer", quadtreeTerrain.generationConstantsBuffer);
computeShader.SetBuffer(_kernel[0], "patchGeneratedDataBuffer", quadtreeTerrain.patchGeneratedDataBuffer);
// Dispatch first kernel CSMain1
computeShader.Dispatch(_kernel[0], THREADGROUP_SIZE_X, THREADGROUP_SIZE_Y, THREADGROUP_SIZE_Z);
// Set Buffers CSMain2
//computeShader.SetBuffer(_kernel[1], "generationConstantsBuffer", quadtreeTerrain.generationConstantsBuffer);
//computeShader.SetBuffer(_kernel[1], "patchGeneratedDataBuffer", quadtreeTerrain.patchGeneratedDataBuffer);
// Dispatch second kernel CSMain2
//computeShader.Dispatch(_kernel[1], THREADGROUP_SIZE_X, THREADGROUP_SIZE_Y, THREADGROUP_SIZE_Z);

I guess the problem is that the second C# Dispatch() to the second kernel is done (although nothing happens in the second kernel at the moment) while the first one is still being worked on in the first kernel.

How do you determine and orchestrate the Dispatch() calls of two or more kernels on the CPU in C# code in Unity?

joergzdarsky · Dec 14, 2015

Both and especially the second stage / second kernel are now correctly invoked. For the next one that comes by that problem:
The error was that in the compute shader I had both pragma definitions on top, afterwards both functions. Like:

Code (CSharp):

#pragma kernel CSMain1

#pragma kernel CSMain2

[numthreads(threadsPerGroup_X,threadsPerGroup_Y,1)]

void CSMain1 (uint3 id : SV_DispatchThreadID)

{

// code

}

void CSMain2 (uint3 id : SV_DispatchThreadID)

{

// code

}

Things started to work when I put the code in different order:

Code (CSharp):

#pragma kernel CSMain1

[numthreads(threadsPerGroup_X,threadsPerGroup_Y,1)]

void CSMain1 (uint3 id : SV_DispatchThreadID)

{

// code

}

#pragma kernel CSMain2

[numthreads(threadsPerGroup_X,threadsPerGroup_Y,1)]

void CSMain2 (uint3 id : SV_DispatchThreadID)

{

// code

}

There are a few (of the only few) compute shader tutorials around which describe my above initial implementation which didnt work. So anyone who has the same problem like me might try to change the order of codelines like above.

sirshelley · Jan 1, 2016

Hello there!, I have had similar issues and my solution :
Don't use get data in real time! (took me 2 months to find the reason, which I can explain another time if you like)
Instead:
Make an array of compute buffers, a minimum of 2: one for read, one for write.
Do a rw structure in compute, filled with junk data to be overridden in the compute
On the next frame (this is import - I suggest doing an enum to 'waitforenfoframe' yield )
Lastly copy the Contents of the write buffer into read.
Then use these cloned buffers with whatever you need - get data does work on static buffers in this case.

ModLunar · Jun 14, 2018

sirshelley said: ↑

Hello there!, I have had similar issues and my solution :
Don't use get data in real time! (took me 2 months to find the reason, which I can explain another time if you like)
Instead:
Make an array of compute buffers, a minimum of 2: one for read, one for write.
Do a rw structure in compute, filled with junk data to be overridden in the compute
On the next frame (this is import - I suggest doing an enum to 'waitforenfoframe' yield )
Lastly copy the Contents of the write buffer into read.
Then use these cloned buffers with whatever you need - get data does work on static buffers in this case.
Click to expand...

What do you mean don't use GetData(...) in realtime? Is there a way to get the data without calling GetData(...) that we should do instead?

djarcas · Sep 26, 2018

sirshelley said: ↑

Hello there!, I have had similar issues and my solution :
Don't use get data in real time! (took me 2 months to find the reason, which I can explain another time if you like)
Instead:
Make an array of compute buffers, a minimum of 2: one for read, one for write.
Do a rw structure in compute, filled with junk data to be overridden in the compute
On the next frame (this is import - I suggest doing an enum to 'waitforenfoframe' yield )
Lastly copy the Contents of the write buffer into read.
Then use these cloned buffers with whatever you need - get data does work on static buffers in this case.
Click to expand...

What happens if the ComputeShader takes more than 2 frames...?

Cambesa · Feb 19, 2020

sirshelley said: ↑

Hello there!, I have had similar issues and my solution :
Don't use get data in real time! (took me 2 months to find the reason, which I can explain another time if you like)
Instead:
Make an array of compute buffers, a minimum of 2: one for read, one for write.
Do a rw structure in compute, filled with junk data to be overridden in the compute
On the next frame (this is import - I suggest doing an enum to 'waitforenfoframe' yield )
Lastly copy the Contents of the write buffer into read.
Then use these cloned buffers with whatever you need - get data does work on static buffers in this case.
Click to expand...

Can today be another time? I'm also trying to run a shader multiple times after eachother but I can not find a way to check wether a shader is done. Does GetData() wait for the shader to complete? I'm also trying to run this in editor so I can not use yield return new WaitForEndOfFrame().

sirshelley · Mar 2, 2020

GetData simply overrides the CPU buffer by whatever is available. The best way to ensure you have completion is a loop and extra a variable for a checksum, then continue.

Darren-R · Aug 4, 2020

sirshelley said: ↑

The best way to ensure you have completion is a loop and extra a variable for a checksum, then continue.
Click to expand...

Hey SirShelly, I am trying to figure this out but I'm really new to compute shaders, how can I make the checksum variable and extract it?

Cheers!

ModLunar · Aug 5, 2020

'Makes you really wonder where people find this information out...

richardkettlewell · Aug 5, 2020

Cambesa said: ↑

Does GetData() wait for the shader to complete
Click to expand...

Yes it does.

sirshelley said: ↑

GetData simply overrides the CPU buffer by whatever is available.
Click to expand...

No it doesn't.

ModLunar said: ↑

Makes you really wonder where people find this information out...
Click to expand...

Let's try and dispel some myths here....

If you dispatch a ComputeShader, then call ComputeBuffer.GetData on a buffer written by the shader, you should see the data that the ComputeShader wrote. There is no "it's still in progress" or anything like that. It should be the data as it is after the ComputeShader ran. No exceptions. Anything different is a bug, which should be reported.

FWIW, you shouldn't use ComputeBuffer.GetData for anything other than debugging purposes, because reading data back from the GPU is slow, and doing it immediately after dispatching the ComputeShader is making the CPU wait for the GPU to finish doing something. Graphics pipelines are not designed to send data back to the CPU quickly. GPUs like to be told what to do by CPUs, and then left to get on with it. The exception to this guideline is if you use the AsyncGPUReadback API. This API lets you ask the GPU to send something back, without waiting for it to happen. Then, it's up to you to ask if it's done yet, and not block the CPU if the data isn't ready yet. The GPU will usually send it back in something like 1-3 frames. If you're in a situation where you feel like you need the data back immediately, it may be time to rethink what you're doing and do some redesigning of your algorithm.

ModLunar · Aug 5, 2020

@richardkettlewell Apologies if that came off sounding rude, thanks so much for your explanation, that helps a lot!

This would be really great info for the docs page on ComputeBuffer.GetData if possible, I'll leave a suggestion on that page as well.

Darren-R · Aug 7, 2020

Thanks for the reply @richardkettlewell, I've found it extremely difficult to find info on compute shaders, it's like researching the holy grail , so thank you Richard!

richardkettlewell · Aug 11, 2020

I'm sorting out getting this added to the docs

talofen · Sep 2, 2020

Is there a way to know if the ComputeShader Kernel has completed, without issuing a GetData() call?
I'm asking because I have a computeshader that needs to be called thousands of times per second. So I'm calling the Dispatch() on FixedUpdate(), inside a loop:

Code (CSharp):

for (int i = 0; i < numberOfStepsPerFixedUpdate; i++)

{

shader.Dispatch(khUpdateSimulation, Mathf.CeilToInt((float)vertices.Length / THREADGROUP_SIZE), 1, 1);

}

The problem is that I want to call as many loop iterations as possible by changing
numberOfStepsPerFixedUpdate
dynamically, but I need to check that I'm not calling more iterations than the GPU is able to process. I tried many workarounds, but none looks perfect:

1. Adding a
dummy.GetData();
and retrieve a dummy buffer, which I don't need, and time how long the loop has run for. If too much, reduce iterations, if too litte, increase.
2. Monitoring framerate, and lower the loop count if frame rate goes down.

But idea number 1 adds a very costly (performace-wise) GetData() call that I really don't need....not a good solution
idea number 2 does not really work well, because if frame rate goes down for whatever reason (CPU?) I get an unwanted reduction of the loop count.

I would need something that tells me "okay, now your dispatch calls have completed" without a performance hit...
Any ideas?
Thank you

I forgot: At this time, Target platform is Windows and API is DirectX11

methusalah999 · Oct 13, 2020

talofen said: ↑
Is there a way to know if the ComputeShader Kernel has completed, without issuing a GetData() call?
I'm asking because I have a computeshader that needs to be called thousands of times per second. So I'm calling the Dispatch() on FixedUpdate(), inside a loop:

Code (CSharp):

for (int i = 0; i < numberOfStepsPerFixedUpdate; i++)

{

shader.Dispatch(khUpdateSimulation, Mathf.CeilToInt((float)vertices.Length / THREADGROUP_SIZE), 1, 1);

}

The problem is that I want to call as many loop iterations as possible by changing
numberOfStepsPerFixedUpdate
dynamically, but I need to check that I'm not calling more iterations than the GPU is able to process. I tried many workarounds, but none looks perfect:

1. Adding a
dummy.GetData();
and retrieve a dummy buffer, which I don't need, and time how long the loop has run for. If too much, reduce iterations, if too litte, increase.
2. Monitoring framerate, and lower the loop count if frame rate goes down.

But idea number 1 adds a very costly (performace-wise) GetData() call that I really don't need....not a good solution
idea number 2 does not really work well, because if frame rate goes down for whatever reason (CPU?) I get an unwanted reduction of the loop count.

I would need something that tells me "okay, now your dispatch calls have completed" without a performance hit...
Any ideas?
Thank you

I forgot: At this time, Target platform is Windows and API is DirectX11
Click to expand...
I don't know of any way to know if some dispatch has been completed on the GPU. GetData won't let you know that either because it is slow.

I dn't know exactly what you are trying to achieve, but if you try to maximize the dispatch iterations while keeping a target fps, you can run a growing number of iterations and stop growing it when the frame rate get to the target, including a margin. It will work only if your compute shader thread have little execution divergence, that is. Also, your graphic card will melt your computer ^^

kadd11 · Oct 24, 2021

Sorry to necro, but a somewhat related question:

@richardkettlewell , would you mind answering/confirming a couple of GPU side ordering questions/hypothesis?

If I call ComputeShader.Dispatch and use a buffer written to by that compute shader in a standard rendering shader (being invoked by Graphics.DrawMeshX), that compute is guaranteed to finish before the draw call happens, is that correct? (basing this off of the GraphicsFence docs, specifically "GPUFences do not need to be used to synchronise a GPU task writing to a resource that will be read as an input by another")

If I dispatch a compute shader kernel several times in a row via CommandBuffer.Dispatch, and all of the dispatches write to the same AppendStructuredBuffer, are each of the dispatches guaranteed to finish before the next dispatch runs?

richardkettlewell · Oct 24, 2021

kadd11 said: ↑

Sorry to necro, but a somewhat related question:

@richardkettlewell , would you mind answering/confirming a couple of GPU side ordering questions/hypothesis?

If I call ComputeShader.Dispatch and use a buffer written to by that compute shader in a standard rendering shader (being invoked by Graphics.DrawMeshX), that compute is guaranteed to finish before the draw call happens, is that correct? (basing this off of the GraphicsFence docs, specifically "GPUFences do not need to be used to synchronise a GPU task writing to a resource that will be read as an input by another")

If I dispatch a compute shader kernel several times in a row via CommandBuffer.Dispatch, and all of the dispatches write to the same AppendStructuredBuffer, are each of the dispatches guaranteed to finish before the next dispatch runs?

Click to expand...

1. Yes
2. Yes

richardkettlewell · Oct 24, 2021

FYIs to earlier posters, the doc page for GetData was updated some time ago to include info that it always returns up to date results: https://docs.unity3d.com/ScriptReference/ComputeBuffer.GetData.html

kadd11 · Oct 24, 2021

Short and sweet, thanks!

kadd11 · Oct 28, 2021

Actually, one quick follow up @richardkettlewell related to render and compute shader order: Is it kosher to schedule a compute shader mid-render pipeline (for example using a command buffer with
CameraEvent.BeforeForwardAlpha
)? And does that answer change for tile based gpus?

Asking out of general curiosity because I'd like to better understand the relationship between the rendering and compute queue/hardware, but also admittedly I've had a day full of fun debugging that exact scenario on a quest 2. Specifically, scheduling a compute shader in that way causes issues with the content rendered after the camera event, even if the following draw calls have no dependency on the compute shader output (i.e., a simple unlit color shader). Either the following draw calls don't render at all (which seems to happen in single pass) or b) the tiles flicker in and out randomly (which happens in multi pass).

If I take a capture with render doc (even Oculus' fork of it), everything looks good in the capture, and it also works fine if I use Oculus link. But running on the quest directly has these issues, and I'm not sure if interleaving draw calls and compute shaders just isn't safe to do but happens to be handled well on the few other platforms I've tested on, or if it's a bug somewhere down the stack.

DominiqueSandoz · Mar 9, 2022

richardkettlewell said: ↑

1. Yes
2. Yes

Click to expand...

I found this simple answer clarifying as hell for my own question. To further clarify, with 2. you are saying that the order of Dispatches is guaranteed. Is this only guaranteed when using CommandBuffers or also when simply Dispatching different calls? Because it seems so, although i am not sure why.

Consider the following code:

Code (csharp):

void Run(Texture3D inTex, RenderTexture outTex, ComputeShader compute)

{

var kernelA = compute.FindKernel("Reset");

compute.SetTexture(kernelA, "ReadTexture", inTex);

compute.SetTexture(kernelA, "WriteTexture", outTex);

compute.Dispatch(kernelA, 64, 64, 64);

Graphics.CopyTexture(outTex, inTex);

var kernelB = _compute.FindKernel("Iterative");

for (int i = 0; i < 10; i++)

{

compute.SetTexture(kernelB, "ReadTexture", inTex);

compute.SetTexture(kernelB, "WriteTexture", outTex);

compute.Dispatch(kernelB, 64, 64, 64);

Graphics.CopyTexture(outTex, inTex);

}

}

It "seems" to run fine - can I rely in this situation that all Invocations work with the results from the previous invocation? If yes, why?

methusalah999 · Mar 9, 2022

DominiqueSandoz said: ↑

I found this simple answer clarifying as hell for my own question. To further clarify, with 2. you are saying that the order of Dispatches is guaranteed. Is this only guaranteed when using CommandBuffers or also when simply Dispatching different calls? Because it seems so, although i am not sure why.

Consider the following code:

Code (csharp):

void Run(Texture3D inTex, RenderTexture outTex, ComputeShader compute)

{

var kernelA = compute.FindKernel("Reset");

compute.SetTexture(kernelA, "ReadTexture", inTex);

compute.SetTexture(kernelA, "WriteTexture", outTex);

compute.Dispatch(kernelA, 64, 64, 64);

Graphics.CopyTexture(outTex, inTex);

var kernelB = _compute.FindKernel("Iterative");

for (int i = 0; i < 10; i++)

{

compute.SetTexture(kernelB, "ReadTexture", inTex);

compute.SetTexture(kernelB, "WriteTexture", outTex);

compute.Dispatch(kernelB, 64, 64, 64);

Graphics.CopyTexture(outTex, inTex);

}

}

It "seems" to run fine - can I rely in this situation that all Invocations work with the results from the previous invocation? If yes, why?
Click to expand...

Basically, everything that is asked to the GPU is added to a queue and the GPU will execute the queue in order. That is true for any get, set, dispatch, copy, render, etc. I don't know of any GPU racing situation with the Unity API (which is great).

kadd11 · Mar 9, 2022

I want to add clarification as well, because the statement that Unity handles the dependencies and everything runs in the order specified tripped me up for a while on tile based GPUs. I don't think it's Unity's fault, I think it's just a result of how tile GPUs work (but I could be wrong). My scenario was:

- Draw Call which writes to an
AppendStructuredBuffer
- Compute Dispatch: Consumes the
AppendStructuredBuffer
from the previous draw call and outputs some data into another structured buffer
- Draw Call consumes the structured buffer from the compute shader

While this worked on my discrete gpu, it did not work on the tile based GPUs that I tried. I misunderstood how tile based GPUs work. I thought they went through each draw call, and rendered it tile by tile. Instead, they go through each tile and render all draw calls for the tile before moving on to the next tile. Which means, without explicitly adding a break somehow (like saving off the results of the first draw call to a render texture, running the compute, and then continuing the subsequent draw calls), I don't think there's a way to interleave draw calls and compute dispatches on tile based GPUs.

Maybe this is obvious to those who know, but wanted to mention it in case anyone else hits a similar issue. Also, feel free to correct me if I'm wrong.

DominiqueSandoz · Mar 10, 2022

methusalah999 said: ↑

Basically, everything that is asked to the GPU is added to a queue and the GPU will execute the queue in order. That is true for any get, set, dispatch, copy, render, etc. I don't know of any GPU racing situation with the Unity API (which is great).
Click to expand...

Thank you so much. If this is the case, it is quite awesome. From what I understand now, there is 1 queue on the GPU where all commands get added to and executed in order.

Does that mean then:

1. There is exactly command at the time running on the GPU and the next only starts if the this one finished completely
2. If that is true, why does a long running Compute Shader not interrupt the scene rendering?

For 2., I get the feeling from my tests that running Compute Shaders is somehow "free", as it doesn't seem to impact my scene FPS at all and completes silently after a few frames (using AsyncGPUReadback). Is this me, hallucinating?

methusalah999 · Mar 10, 2022

DominiqueSandoz said: ↑

Thank you so much. If this is the case, it is quite awesome. From what I understand now, there is 1 queue on the GPU where all commands get added to and executed in order.

Does that mean then:

1. There is exactly command at the time running on the GPU and the next only starts if the this one finished completely
2. If that is true, why does a long running Compute Shader not interrupt the scene rendering?

For 2., I get the feeling from my tests that running Compute Shaders is somehow "free", as it doesn't seem to impact my scene FPS at all and completes silently after a few frames (using AsyncGPUReadback). Is this me, hallucinating?
Click to expand...

I wouldn't say that the GPU execute everything in order on its side because I simply don't know. But as far as I know, you just can't produce a racing condition. A data modified by your compute shader can't be read by the rendering shader before the compute shader finishes execution, if you dispatch it before the rendering.

The rendering will "wait" for your dispatch to finish. It is not exactly waiting, because the rendering and the compute shader are using the same computational ressource, so it is more exact to say that the rendering will be delayed because of your compute shader execution. To may knowledge, there is no way to run a compute shader "asynchronously". Unintuitively, a GPU does not have multiple pipelines of executions like a CPU and is not multi-threaded in the same way. A GPU execute a batch of the same computation at once, wait for it to finish before getting to the next batch.

So, when you observe that the result of your compute shader is available only a few frames after its dispatch, there are multiple things to consider:

if the compute shader is dispatched during the frame "n" (in Update, LateUpdate, PreCull, etc.), the rendering of the frame "n" won't occur before it finishes.

if your GPU runs at 30fps because of long rendering or compute shaders, and the CPU runs at 60fps, then the CPU code will wait for the GPU to finish. In the Profiler, you will see a line "GFXWaitForRenderThread" that lower your CPU fps and make it wait for the GPU.

when the CPU waits for the GPU, it is one logic frame ahead. The CPU will only wait for the GPU to finish the rendering of the previous image, so it can run the logic of the frame "n+1" while the GPU is rendering and drawing the image of the frame "n" on the screen. If you think about it, the CPU "could" compute multiple logic frames ahead in this situation, but many computations need to know the duration of the previous frame (Time.deltaTime) so it has to wait.

when you use AsyncGPUReadback, you ask the GPU to send data back in the CPU memory after the compute shader is finished. This is useful to avoid waiting for the GPU in the middle of your logic, and keep the CPU one logic frame ahead if it is fast enough. So the data of the frame "n" will most likely be computed by the GPU at some point during the "n+1" CPU logic frame.

most importantly with AsyncGPUReadback, a buffer will be written asynchronously on the CPU side. The hardware data bus between a CPU and a GPU is very fast for downloading (CPU => GPU) and very slow for uploading (GPU => CPU). This is due to the fact that by design, a GPU only consume data from the game logic and send the result right to the screen. GPU are now often used to produce computations that need to be going back to the logic (GPGPU) but the download is still more important in most situations. So, depending on the size of the data, the transfert may take several frames of time.

All of this is only my personal knowledge and should be taken as such ^^

DominiqueSandoz · Mar 10, 2022

methusalah999 said: ↑

I wouldn't say that the GPU execute everything in order on its side because I simply don't know. But as far as I know, you just can't produce a racing condition. A data modified by your compute shader can't be read by the rendering shader before the compute shader finishes execution, if you dispatch it before the rendering.

The rendering will "wait" for your dispatch to finish. It is not exactly waiting, because the rendering and the compute shader are using the same computational ressource, so it is more exact to say that the rendering will be delayed because of your compute shader execution. To may knowledge, there is no way to run a compute shader "asynchronously". Unintuitively, a GPU does not have multiple pipelines of executions like a CPU and is not multi-threaded in the same way. A GPU execute a batch of the same computation at once, wait for it to finish before getting to the next batch.

So, when you observe that the result of your compute shader is available only a few frames after its dispatch, there are multiple things to consider:

if the compute shader is dispatched during the frame "n" (in Update, LateUpdate, PreCull, etc.), the rendering of the frame "n" won't occur before it finishes.

if your GPU runs at 30fps because of long rendering or compute shaders, and the CPU runs at 60fps, then the CPU code will wait for the GPU to finish. In the Profiler, you will see a line "GFXWaitForRenderThread" that lower your CPU fps and make it wait for the GPU.

when the CPU waits for the GPU, it is one logic frame ahead. The CPU will only wait for the GPU to finish the rendering of the previous image, so it can run the logic of the frame "n+1" while the GPU is rendering and drawing the image of the frame "n" on the screen. If you think about it, the CPU "could" compute multiple logic frames ahead in this situation, but many computations need to know the duration of the previous frame (Time.deltaTime) so it has to wait.

when you use AsyncGPUReadback, you ask the GPU to send data back in the CPU memory after the compute shader is finished. This is useful to avoid waiting for the GPU in the middle of your logic, and keep the CPU one logic frame ahead if it is fast enough. So the data of the frame "n" will most likely be computed by the GPU at some point during the "n+1" CPU logic frame.

most importantly with AsyncGPUReadback, a buffer will be written asynchronously on the CPU side. The hardware data bus between a CPU and a GPU is very fast for downloading (CPU => GPU) and very slow for uploading (GPU => CPU). This is due to the fact that by design, a GPU only consume data from the game logic and send the result right to the screen. GPU are now often used to produce computations that need to be going back to the logic (GPGPU) but the download is still more important in most situations. So, depending on the size of the data, the transfert may take several frames of time.

All of this is only my personal knowledge and should be taken as such ^^
Click to expand...

Holy. I had _no_ idea... and it also means that GPUs are freakingly fast (at suitable tasks) but also that we're effectively dealing with a single core in terms of parallelism of tasks when looking at the GPU. Really unintuitive.

I thank you very much for these insights, extremely valuable. Following your logic I would also conclude that ComputeShaders, no matter how big, are guaranteed to be done the next frame - transferring the data to the CPU can span several frames?

methusalah999 · Mar 10, 2022

DominiqueSandoz said: ↑

Holy. I had _no_ idea... and it also means that GPUs are freakingly fast (at suitable tasks) but also that we're effectively dealing with a single core in terms of parallelism of tasks when looking at the GPU. Really unintuitive.

I thank you very much for these insights, extremely valuable. Following your logic I would also conclude that ComputeShaders, no matter how big, are guaranteed to be done the next frame - transferring the data to the CPU can span several frames?
Click to expand...

Compute shader kernels that are dispatched during the frame "n" are guaranteed to be executed before the rendering of the frame "n", which is fortunate because the frame to render is generally dependent of the compute shader result. Of course, at that moment, the result of the kernels is only available in buffers stored in the GPU memory and not in the CPU side.

If your kernels are too slow, then of course you will delay the rendering and impact your FPS.

As for the transfer of data from GPU memory to CPU memory, if you ask for it asynchrounously, then it can be ready at any number of frames in the future, depending on the volume of data to transfer. If you ask for the data synchronously with buffer.GetData for example, then you will suffer a double delay. First, your CPU code will halt and wait until the GPU has executed all its queue to reach this point. Second, the CPU code will have to wait again for the data to be uploaded from the GPU memory, which is quite slow.

If you must do that, you should at least do it as late as possible in the CPU frame logic (LateUpdate, PreCull and such, not in Update), so there is the least possible tasks remaining in the GPU queue to wait for.

JJRivers · Apr 2, 2022

DominiqueSandoz said: ↑

Holy. I had _no_ idea... and it also means that GPUs are freakingly fast (at suitable tasks) but also that we're effectively dealing with a single core in terms of parallelism of tasks when looking at the GPU. Really unintuitive.

I thank you very much for these insights, extremely valuable. Following your logic I would also conclude that ComputeShaders, no matter how big, are guaranteed to be done the next frame - transferring the data to the CPU can span several frames?
Click to expand...

That's one way to think of it, they are SIMD processors, which is an entirely different beast from a modern CPU architecture, if you want a deeper understanding googling something like CPU vs SIMD can help you there a lot.

So yes it's essentially a Hugely wide single core but only in terms of the wavefront, which are processed in hardware vendor specific sizes (commonly 32 on NVIDIA and 64 on AMD, generally when you're not doing wildly random access of memory it's best to keep numthreads as even groupings of those where possible (not multiple as often implied).

The parallelism comes from the fact that there are thousands of blocks of these wavefronts (also known as warps) and the GPU tries to schedule these in a manner that maintains maximum occupancy.

Example with hypothetical numbers: GPU has 128 warps available, you could schedule 64 of those for one computeshader and at the same time it could schedule 64 other compute shaders with 1 warp size at the same time provided they do not rely on the first 64 warp size one as dependency (you could on lowlevel API level not obey that but that's really bad mojo and why unity doesn't allow you to achieve that, ie why unity not only guarantees but enforces that .Dispatch() for two dependent shaders must be executed in order to prevent doing bad mojo stuff).

I'm fairly junior with compute shaders too but i believe atleast most of the things i said here is correct and if you do spot an error please correct me soonest!

laurentlavigne · Sep 13, 2022

richardkettlewell said: ↑

Let's try and dispel some myths here....
Click to expand...

Thanks for that!
Is this still Gospel with all the new graphics API of 2021+?

BoltScripts · Feb 19, 2023

One thing that slightly confuses me is in the situation of dispatching a shader several times in a loop, how does changing a variable in the shader work with that?
Is something like "SetInt" simply queued the same as a Dispatch call?

methusalah999 · Feb 19, 2023

BoltScripts said: ↑

One thing that slightly confuses me is in the situation of dispatching a shader several times in a loop, how does changing a variable in the shader work with that?
Is something like "SetInt" simply queued the same as a Dispatch call?
Click to expand...

Everything is queued altogether indeed, the setting and the dispatching.

BoltScripts · Feb 20, 2023

Alright, sick. Everything makes sense and is as it should be.
And with that, I think this is the definitive thread to answer all the questions I had about how gpu compute scheduling works.

BoltScripts · Feb 24, 2023

Hate to be back here, but I do seem to have encountered an issue, seems like a bug.

Code (CSharp):

ComputeBuffer countBuff = new ComputeBuffer(gMesh.chunks.Length, sizeof(int));

for (int i = 0; i < gMesh.chunks.Length; i++) {

var chunk = gMesh.chunks[i];

int dispatchCount = chunk.terrainTriCount * gMesh.instanceCount;

compute.SetVector(_chunkPosID, chunk.chunkPos);

compute.SetInt(chunkID, i);

compute.SetInt(dispatchCountID, dispatchCount);

posKernel.DispatchByCount(dispatchCount);

ComputeBuffer.CopyCount(posBuffer, countBuff, i * sizeof(int));

}

In this I just need to get actual count of appended instances per chunk, and this code works fine on my PC, but if I try it on my phone, it seems to be massively out of sync or something and produces results like this:

It works fine on mobile as well if I use GetData after every dispatch, but that obviously is not ideal.
Am I missing something big here?
One weird thing that makes me feel like it's my fault somehow is that the values are consistent between runs, it's always 157496 and always switches at index 74.
For now I've just solved this doing an interlocked add to get the counts and this works without needing to use CopyCount but still just seems like this was a bug.

Search Unity

Unity ID

Useful Searches

Check if a ComputeShader.Dispatch() command is completed on GPU before doing second kernel dispatch

Unity Technologies

Unity Technologies

Unity Technologies

Unity Technologies