Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.

Bug DrawProceduralIndirect incorrect stride on AMD GPUs for SV_VertexID indexed StructuredBuffers

Discussion in 'General Graphics' started by benoneal, Nov 17, 2022.

  1. benoneal

    benoneal

    Joined:
    Apr 15, 2016
    Posts:
    31
    I developed an app on an Nvidia GPU which generates a procedural mesh and renders it to a custom shader using Graphics.DrawProceduralIndirect. The code for that call is simple:

    Code (CSharp):
    1. Graphics.DrawProceduralIndirect(
    2.   terrainMaterial,
    3.   bounds,
    4.   MeshTopology.Triangles, meshArgsBuffer, 0,
    5.   null, null,
    6.   ShadowCastingMode.On, true, gameObject.layer
    7. );
    However, users of my app with an AMD GPU (seems to be all models at this point) all report the same bug: the mesh is completely glitched out. I tried debugging this for a few days but without being able to reproduce it on my machine, I was stabbing in the dark. So I bought an AMD RX580, and immediately was able to reproduce the bug.

    On my Nvidia GPU, it looks like this:
    upload_2022-11-17_22-0-31.png

    But on my AMD GPU (same system, I just swapped out the graphics card), it looks like this:
    upload_2022-11-17_22-2-35.png

    Using Renderdoc, I eventually traced down the issue, which confirmed what it looked like: for some reason, AMD GPUs are putting together triangles from the wrong vertices. Specifically, instead of grabbing vertices from the StructuredBuffer in sequential groups of 3 ([0,1,2], [3,4,5] etc), it's grabbing every third vertex ([0,3,6], [9,12,15], and so on). Here's the mesh buffer vertex data from Renderdoc:

    upload_2022-11-17_22-6-49.png

    And here's the Vertex Shader indexed order:

    upload_2022-11-17_22-7-52.png

    As you can see, it's stitching together every third indexed element from my input StructuredBuffer. And since the mesh is populated as an AppendStructuredBuffer, order is inconsistent every frame. Originally I was thinking of refactoring my compute shaders to not use Append(), but that actually wouldn't make a difference if the SV_VertexID was wrong for AMD cards anyway.

    My guess is that the "stride" is set incorrectly or something, but TBH I can't find any information about this and every example I can find looks exactly like my code (hence why it works flawlessly on my Nvidia GPU I guess).

    My custom vert/frag shader looks like this:

    Code (Boo):
    1. Shader "Custom/Planet" {
    2.   SubShader {
    3.         Tags { "RenderType"="Opaque" "LightMode"="ForwardBase" }
    4.     Cull back
    5.     ZWrite On
    6.  
    7.         Pass {
    8.       CGPROGRAM
    9.       #pragma vertex vert
    10.       #pragma fragment frag
    11.       #pragma target 4.5
    12.      
    13.       #include "UnityCG.cginc"
    14.       #include "Includes/Constants.hlsl"
    15.       #include "Includes/Utils.hlsl"
    16.       #include "Includes/Noise.hlsl"
    17.       #include "Visualise.hlsl"
    18.  
    19.       struct Vertex {
    20.         float4 pos;
    21.         float4 hmec;
    22.         float4 crfs;
    23.       };
    24.  
    25.       struct v2f {
    26.         float4 vertex: SV_POSITION;
    27.         float4 world: TEXCOORD0;
    28.         float4 hmec: TEXCOORD1;
    29.         float4 crfs: TEXCOORD2;
    30.       };
    31.      
    32.       StructuredBuffer<Vertex> mesh;
    33.  
    34.       float3 sun_dir;
    35.       float4 sun_col;
    36.       float3 view;
    37.       float equidistant;
    38.       int render_ocean;
    39.       int visualisation;
    40.      
    41.       v2f vert (uint id : SV_VertexID) {
    42.         v2f o;
    43.         o.world = mesh[id].pos;
    44.         o.vertex = UnityObjectToClipPos(o.world);
    45.         o.hmec = mesh[id].hmec;
    46.         o.crfs = mesh[id].crfs;
    47.         return o;
    48.       }
    49.  
    50.       float4 frag (v2f i) : SV_TARGET {
    51.         // omitted as irrelevant
    52.       }
    53.       ENDCG
    54.     }
    55.   }
    56.     Fallback "Diffuse"
    57. }
    58.  
     
  2. Neto_Kokku

    Neto_Kokku

    Joined:
    Feb 15, 2018
    Posts:
    1,578
    How are you creating the structured buffer "mesh" in C#? I doubt vertex ID is incorrect, it seems it could be a stride problem when reading the data itself.
     
  3. benoneal

    benoneal

    Joined:
    Apr 15, 2016
    Posts:
    31
    I'm creating the mesh in a compute shader, hence DrawProceduralIndirect. The shaders to create the mesh are a bit too complicated to paste here and split over several kernels (as there are many non-mesh related steps to build up the data required), but the types are:
    Code (Boo):
    1. struct Vert {
    2.   float4 pos;
    3.   float4 hmec; // data for frag shader
    4.   float4 crfs; // data for frag shader
    5. };
    6.  
    7. struct Triangle {
    8.   Vert a;
    9.   Vert b;
    10.   Vert c;
    11. };
    12.  
    13. AppendStructuredBuffer<Triangle> mesh;
    14. RWStructuredBuffer<uint> indirect_mesh_args;
    15.  
    16. // buffer filled via
    17. mesh.Append(triangle_a);
    18. mesh.Append(triangle_b);
    19. InterlockedAdd(indirect_mesh_args[0], 6);
    As a stab in the dark to test the stride theory, I changed my buffer allocation in C# from:
    Code (CSharp):
    1. meshBuffer = new ComputeBuffer(verts, sizeof(float)*4*3*3, ComputeBufferType.Append);
    to:
    Code (CSharp):
    1. meshBuffer = new ComputeBuffer(verts, sizeof(float)*4*3, ComputeBufferType.Append);
    but it appeared to make no difference.
     
    Last edited: Nov 17, 2022
  4. Neto_Kokku

    Neto_Kokku

    Joined:
    Feb 15, 2018
    Posts:
    1,578
    Does it change anything if you do it like this?

    Code (CSharp):
    1. meshBuffer = new ComputeBuffer(verts * 4 * 3, sizeof(float), ComputeBufferType.Append);
     
  5. benoneal

    benoneal

    Joined:
    Apr 15, 2016
    Posts:
    31
    Yeah if I change it like that it completely screws the buffer and the ouput is:
    verts x 4x3x3 render.jpg

    From renderdoc, I grabbed the result of my earlier attempt to fudge the stride, and the results are below:
    verts x 3 buffer.jpg verts x 3 VS.jpg

    As you can see here, the first triangle is correct, in that the vertex ordering is [0,1,2] from the StructuredBuffer, but from then on, it's broken because it's "striping" previous vertices like [0,1,2], [1,2,3], [2,3,4], [3,4,5] and so on. But when it reads each vertex in the shader, it appears to be correctly aliasing the full byte width to get all three float4s per vertex.

    This is so frustrating. It seems like the stride is being set to 3x what it should be at some level under the hood.
     
  6. kyriew

    kyriew

    Joined:
    Sep 10, 2019
    Posts:
    7
    When I use this API, some mesh will have errors, while others will not. I have not idea what happend.
     
  7. c0d3_m0nk3y

    c0d3_m0nk3y

    Joined:
    Oct 21, 2021
    Posts:
    257
    My understanding of RenderDoc is that it intercepts all draw calls and uses its own software renderer to play back the capture. This means, if you take a capture on the NVidia card and one on the AMD card, there should be a difference somewhere. If you open the AMD capture on a NVidia card, the capture result should be buggy as well. It's not a driver bug if you see it in RenderDoc, as far as I am aware.

    Find out where the difference is - maybe in the DrawIndexedInstancedIndirect direct or indirect parameters. It is also possible that you forgot to set some value in the command list which is being inherited from whatever happens to be rendered before the DrawProceduralIndirect call.
     
  8. benoneal

    benoneal

    Joined:
    Apr 15, 2016
    Posts:
    31
    Not sure I understand. I've been using the same AMD card to troubleshoot since I've received it. All captures from Renderdoc have been made from the AMD card, and the Renderdoc diagnosis aligns and explains why the triangles are all over the place in unity.

    Summarised, the mesh building and rendering methods look like this:
    Code (CSharp):
    1. void GenerateMesh() {
    2.   int kernel = terrainShader.FindKernel("build_mesh");
    3.  
    4.   meshBuffer.SetCounterValue(0);
    5.   command.SetComputeBufferParam(terrainShader, kernel, Shader.PropertyToID("tiles"), tileBuffer);
    6.   command.SetComputeBufferParam(terrainShader, kernel, Shader.PropertyToID("mesh"), meshBuffer);
    7.   indirectMeshArgsBuffer.SetData(new int[]{0,1,0,0});
    8.   command.SetComputeBufferParam(terrainShader, kernel, Shader.PropertyToID("indirect_mesh_args"), indirectMeshArgsBuffer);
    9.  
    10.   command.DispatchCompute(terrainShader, kernel, tileArgsBuffer, 0);
    11.   Graphics.ExecuteCommandBuffer(command);
    12.   command.Clear();
    13. }
    14.  
    15. void RenderMesh() {
    16.   // irrelevant shader vars omitted for brevity
    17.   terrainMaterial.SetBuffer("mesh", meshBuffer);
    18.  
    19.   bounds = new Bounds(transform.position, Vector3.one * radius);
    20.  
    21.   Graphics.DrawProceduralIndirect(
    22.     terrainMaterial,
    23.     bounds,
    24.     MeshTopology.Triangles, indirectMeshArgsBuffer, 0,
    25.     null, null,
    26.     ShadowCastingMode.On, true, gameObject.layer
    27.   );
    28. }
    As above, the command buffer is cleared immediately before the RenderMesh method, though I doubt it would interfere as the rendering call doesn't use the command buffer (I could never get that to actually work on any card).
     
  9. c0d3_m0nk3y

    c0d3_m0nk3y

    Joined:
    Oct 21, 2021
    Posts:
    257
    What I am trying to say is - my hypothesis is that it is not a AMD driver bug. If you save the capture as a file and open the file on a system with a NVidia card, you should also see the bug. This means, there must be a command in the capture that is causing this - could be wrong stride, as you are assuming.

    By comparing a capture from a NVidia card and one from the AMD card, you should be able to find out exactly where the difference is (which command or parameter is different)? It might not be your fault (could be a Unity bug) but at least you know exactly which parameter is wrong.

    Just because a command buffer is empty, it doesn't mean that all the state is reset to default values. Some state is inherited from whatever other command buffers where executed before yours (for example CommandBuffer.SetInvertCulling). I can't think of any state that would cause this particular issue, though.
     
    Last edited: Nov 18, 2022
  10. benoneal

    benoneal

    Joined:
    Apr 15, 2016
    Posts:
    31
    Thanks for clarifying c0d3_m0nk3y. I think I get what you mean now. So I saved a capture on the AMD card of the buggy behaviour, and again confirmed the same issue present in the vertex shader (stepping through indices of the StructuredBuffer in patterns of [0,3,6], [9,12,15], [18,21,24] etc.).

    Then I switched out the AMD card for my NVIDIA one, and re-opened that saved AMD capture in Renderdoc, and wouldn't you know it, it renders exactly as it should in Renderdoc:
    AMD buffer on NVIDIA.jpg
    upload_2022-11-20_14-8-51.png

    I'm far from an expert in any of this, but to my eyes, if playing an AMD capture on NVIDIA results in the correct non-buggy output, then this looks like an AMD driver issue (despite affected cards being DirectX 11 compliant and I'm using the latest drivers for a 5 1/2 yo card), or some way in which Unity is failing to accommodate for AMD-specific nuances.

    I'm really not sure what more I can do at this point.
     
    richardkettlewell likes this.
  11. c0d3_m0nk3y

    c0d3_m0nk3y

    Joined:
    Oct 21, 2021
    Posts:
    257
    Wow, did not expect this. Looks like an AMD driver issue indeed.

    There is still a small chance that you can fix it: I've had a case where the NVidia driver was silently ignoring wrong stride and AMD cards weren't - if the driver can figure it out itself. I think it was the stride that you pass to the ComputeBuffer constructor, IIRC.

    Try playing with the ComputeBuffer stride value when you have the AMD card installed. Also keep in mind that there might be padding. Buffer elements are aligned/padded according to what OpenGL calls std140 rules:
    https://www.oreilly.com/library/view/opengl-programming-guide/9780132748445/app09lev1sec2.html
     
    Last edited: Nov 20, 2022
    richardkettlewell likes this.
  12. Neto_Kokku

    Neto_Kokku

    Joined:
    Feb 15, 2018
    Posts:
    1,578
    RenderDoc records the commands sent to the GPU plus the contents of buffers uploaded from CPU to GPU, it doesn't store the result of those commands. Everything is run on the GPU you are using to replay the capture.

    AFAIK "software" emulation is limited to shader debugging, where the execution of a shader thread is simulated by RenderDoc itself, which is why it can be very limited in certain scenarios like compute shaders using groupshared memory.

    I still think something is amiss with stride. You can output the vertexId itself to confirm.
     
    c0d3_m0nk3y likes this.
  13. benoneal

    benoneal

    Joined:
    Apr 15, 2016
    Posts:
    31
    If you mean this one:
    Code (CSharp):
    1. meshBuffer = new ComputeBuffer(verts, sizeof(float)*4*3*3, ComputeBufferType.Append);
    Then yeah I've already tried a few variants of that to fudge the stride but maintain total buffer size with results earlier in this thread. While it did impact the VS output, the "best" outcome was that with a one third stride of float*4*3 (instead of float*4*3*3), it rendered the first triangle correctly (indexing [0,1,2]), but all subsequent triangles were only offset by 1 from the initial index (so the next triangle was [1,2,3] followed by [2,3,4]). Meanwhile each vertex in the vertex shader was correcting aliasing the full 3*float4 width defined in the buffer type, which I guess makes sense because it has the type defined in the shader.

    Yeah I think that was c0d3_m0nk3y's point: to confirm that the commands themselves are not incorrectly being set or different somehow when running on an AMD card. So we've ruled that out, which includes any command to set the stride from C#-land which may have been different when run on the AMD.

    So given that the commands are identical and the results are different, it's gotta be something lower down in the driver level or something.

    I'll swap back over to the AMD card and stick the SV_VertexID in a varying or something and post results tomorrow. What are the scenarios we're expecting though? If the ID is sequential but the indexing into the buffer is off, then it's a stride thing I guess, but if the IDs are not sequential for some reason, and are instead actually jumping in threes, then it's some weird other bug? What would that even mean?
     
    Last edited: Nov 20, 2022
  14. Neto_Kokku

    Neto_Kokku

    Joined:
    Feb 15, 2018
    Posts:
    1,578
    There are a few possibilities:

    - Unity is creating the buffer in a way that isn't correct but the Nvidia drivers shrug it off.
    - There is an issue with Unity and AMD drivers
    - There's an issue with AMD drivers.
     
  15. benoneal

    benoneal

    Joined:
    Apr 15, 2016
    Posts:
    31
    (Still haven't had a chance to output the SV_VertexID for testing, but just been thinking...)

    The thing that makes no sense is that there are clearly 2 different "strides". One is the vertex stride, and the second is the triangle stride. There has to be, because when buffer stride in C# land is set to float4*3*3 the AMD card has a vertex stride 3x too wide, and a triangle stride that is sequential, giving triangles like this: [0,3,6],[9,12,15],[18,21,24].

    But when C# stride is set to float4*3, one-third what it really *should* be, vertex stride is correct on the AMD, but triangle stride is one-third too short, giving triangles like this: [0,1,2],[1,2,3],[2,3,4].

    This doesn't really help me narrow the problem down at all but it's really weird.
     
  16. c0d3_m0nk3y

    c0d3_m0nk3y

    Joined:
    Oct 21, 2021
    Posts:
    257
    std140 rules says "Structure alignment will be the alignment of the biggest structure member".

    I don't think there is any padding in your case but you could try making a buffer of Verts instead of a buffer of Triangles:
    Code (csharp):
    1. struct Vert {
    2.   float4 pos;
    3.   float4 hmec; // data for frag shader
    4.   float4 crfs; // data for frag shader
    5. };
    6.  
    7. struct Triangle {
    8.   Vert a;
    9.   Vert b;
    10.   Vert c;
    11. };
    12.  
    13. AppendStructuredBuffer<Vert> mesh; // Vert instead of Triangle
    14.  
    15. RWStructuredBuffer<uint> indirect_mesh_args;
    16.  
    17. // buffer filled via
    18. mesh.Append(triangle_a.a);
    19. mesh.Append(triangle_a.b);
    20. mesh.Append(triangle_a.c);
    21. mesh.Append(triangle_b.a);
    22. mesh.Append(triangle_b.b);
    23. mesh.Append(triangle_b.c);
    24. InterlockedAdd(indirect_mesh_args[0], 6);
    along with
    Code (csharp):
    1. meshBuffer = new ComputeBuffer(verts, sizeof(float)*4*3, ComputeBufferType.Append);
    Update: Actually, thats probably not safe because there is no guarantee that the shader doesn't get interrupted between Append calls.

    Alternatively, you could try inlining the Vert structs just to be sure that there is no padding:
    Code (CSharp):
    1. struct Triangle {
    2.   float4 pos0;
    3.   float4 hmec0;
    4.   float4 crfs0;
    5.   float4 pos1;
    6.   float4 hmec1;
    7.   float4 crfs1;
    8.   float4 pos2;
    9.   float4 hmec2;
    10.   float4 crfs2;
    11. };
     
    Last edited: Nov 20, 2022
  17. benoneal

    benoneal

    Joined:
    Apr 15, 2016
    Posts:
    31
    Yeah I had an earlier debugging attempt to append Verts instead of Triangles and immediately ran into the inevitable race-conditioned out-of-order Verts and suddenly remembered why I was appending full triangles in the first place lol.

    I'll give the inlined verts a try when I get home tonight.
     
  18. benoneal

    benoneal

    Joined:
    Apr 15, 2016
    Posts:
    31
    If anyone finds this thread in future, here's the update/resolution: Unity+AMD == not a good time for GPU-based procedural generation.

    Small note: there was nothing interesting in outputing the ids, they were sequential as expected. And inlining the verts also didn't seem to make any difference, so it likely wasn't a padding issue either.

    The first issue is the stride one. In order to generate dynamic procedural meshes (like for tessellation or LOD), you don't really know precisely how many vertices you'll have. So the obvious solution is to use an AppendStructuredBuffer, otherwise you'll have a sparse mesh which leaves performance on the floor processing malformed triangles (or worse, old irrelevant triangles that haven't been removed from the buffer). But with Append, you have to deal with race conditions, so you set the stride to three verts worth and Append them three at a time like buffer.Append(Triangle). Then when you render them you tell the vertex shader that the buffer contains verts, like StructuredBuffer<Vert>, and now they're all in order and tightly packed.

    Well you can eat eggs with Unity+AMD because it ignores the data type in the vert shader and trusts the CPU instead. If you shrink the stride in Unity-land everything goes weird, and you can't change the stride or count of an existing buffer, so you can't adjust them for rendering. You could probably create some compute shaders to reduce a sparse buffer down to a tightly packed one or copy a <Triangle> buffer into a <Vert> strided buffer, but that's gross.

    My solution here was to engineer a tightly packed buffer which necessitated a large refactor of how I was generating the mesh and leaves me with NFI how I'm going to implement variable LOD based tessellation down the track.

    The second issue is that all my testing leads me to the conclusion that Graphics.DrawProceduralIndirect just flat out does not work on AMD cards. I developed my solution in a new project just to isolate the variables, and even with the simplest mesh, Graphics.DrawProcedural worked, while Graphics.DrawProceduralIndirect never did, even when filling the args_buffer on the CPU side with a known number of vertices. It's just broken, like all the CommandBuffer.DrawX methods seem to be.

    The third is that in compute/vert HLSL, Buffers are more performant than StructuredBuffers (perhaps negligibly, depending on workload), but much to my surprise, I discovered that the AMD card has NFI what to do with Buffers. Just didn't work at all. But changing them to StructuredBuffers did the trick. Amazing. I later found a thread from 2020 reporting the same issue: Compute shader Buffer/RWBuffer vs StructuredBuffer/RWStructuredBuffer - Unity Forum

    It's worth noting that my NVidia card had NONE of these issues. It knew what to do with Buffers, it could execute indirect drawing, and it aliased into data according to the types specified in the shaders. It just worked precisely how you'd expect it do. In all of my testing, every RenderDoc capture taken on the AMD card ran flawlessly on the NVidia.

    Take-home lessons
    Only use Graphics.DrawProcedural(), avoid AppendStructuredBuffers unless you have some scenario where stride is identical across kernels/shaders, and never use Buffers, only StructuredBuffers. And Unity really need to invest dev time into procedural rendering to at least ensure that all the APIs actually work a) at all, and b) across major hardware.
     
  19. richardkettlewell

    richardkettlewell

    Unity Technologies

    Joined:
    Sep 9, 2015
    Posts:
    2,095
    Based on your conclusions, here is an alternative:

    * Change AppendStructuredBuffer to StructuredBuffer, and create in script using Counter flag instead of Append
    * Always declare your buffer stride as the size of a vertex (in shader + script)
    * To generate, call buffer.IncrementCounter once, then write 3 verts to the buffer using (index*3+0, index*3+1, index*3+2)
    * Now, you have N verts in the buffer, but the counter is N/3
    * Next, you need to build your indirect args buffer to use with DrawProceduralIndirect
    * So use buffer.CopyCounter to copy the counter into index 0 of your your args buffer
    * Then dispatch a compute shader that loads index 0 of the args buffer, and stores it multiplied by 3 ( args[0] *= 3; )
    * Then call DrawProceduralindirect with the args buffer

    (I think it's index 0 - double check this tho)
     
    c0d3_m0nk3y likes this.
  20. benoneal

    benoneal

    Joined:
    Apr 15, 2016
    Posts:
    31
    I appreciate your taking the time to respond with a solution.

    The issue for my use-case is that your suggestion only works with the assumption that index == (N)triangles where N is known and constant. Those are the trivial cases where I don't even need to use indirect rendering because I know the number of indexes when I dispatch the kernel. The difficult cases where AppendStructuredBuffer becomes essential for dense buffers are things like marching cubes surfaces, where each index can push up to four triangles, or even minecraft style boxels when you don't mesh internal faces, but in both cases, almost all indexes will push zero triangles. Any 3D voxel to mesh solution will produce incredibly sparse buffers.

    Voxel meshing can (and should) be simple, efficient, and fast, by leveraging the buffer types DX11 gives us. I'm going to have to take a performance hit to post-process sparse meshes into dense meshes, not to mention the wasted memory overhead.

    It really should be a priority for Unity to ensure API parity between different GPU vendors on the same target platform.
     
    thimler9 likes this.
  21. c0d3_m0nk3y

    c0d3_m0nk3y

    Joined:
    Oct 21, 2021
    Posts:
    257
    Why? I don't see anything in Richards suggestion that would require that N is constant.
     
  22. benoneal

    benoneal

    Joined:
    Apr 15, 2016
    Posts:
    31
    This part:
    If we're writing into a buffer by index*3, that's giving us 3 vertex slots per index for one triangle. If we want N triangles, we'd do index*3*N. This implementation isn't necessary for the simple case of say, a 2D terrain heightmap where N=2 triangles per quad, because we definitely already know the vertex count CPU-side so don't need indirect rendering. We need indirect rendering when we don't know the vertex count, and in those cases, N can't be assumed to be constant.
     
    Last edited: Nov 28, 2022
  23. richardkettlewell

    richardkettlewell

    Unity Technologies

    Joined:
    Sep 9, 2015
    Posts:
    2,095
    I see. So the key here is that one kernel may write more than 1 triangle. Marching cubes is a good example to clarify this, thanks for that.

    I'll mention this use-case to the graphics team. See what they think.

    Thanks!
     
  24. c0d3_m0nk3y

    c0d3_m0nk3y

    Joined:
    Oct 21, 2021
    Posts:
    257
    Sorry, I still don't get it. Apologies, if I'm being stupid but isn't the counter value just the triangle count? I don't see why you wouldn't be able to add a variable number of triangles per compute shader call:
    Code (CSharp):
    1. void cs()
    2. {
    3.     uint trianglesToGenerate = CalculateNumberOfTris();   // variable number
    4.     for (uint i = 0; i < trianglesToGenerate; ++i)
    5.     {
    6.         uint firstTriangleIndex = buffer.IncrementCounter();
    7.         uint firstVertexIndex = firstTriangleIndex * 3;
    8.         buffer[firstVertexIndex] = CalculateVertex(i, 0);
    9.         buffer[firstVertexIndex + 1] = CalculateVertex(i, 1);
    10.         buffer[firstVertexIndex + 2] = CalculateVertex(i, 2);
    11.     }
    12. }
    13.  
    And then when you are done will all of that, make a second dispatch call to update the indirect args buffer:
    Code (CSharp):
    1. [numthreads(1,1,1)]
    2. void cs()
    3. {
    4.     uint numTriangles, stride;
    5.     buffer.GetDimensions(numTriangles, stride);
    6.     indirect_mesh_args[0] = numTriangles * 3;
    7. }
    8.  
    It is worth mentioning that you'd need a UAV barrier between those two calls in some graphic APIs (OpenGl, DX12 but not DX11). Not sure if Unity handles that internally.
     
    Last edited: Nov 28, 2022
  25. richardkettlewell

    richardkettlewell

    Unity Technologies

    Joined:
    Sep 9, 2015
    Posts:
    2,095
    Hmm it may be a good point. you wouldn't be guaranteeing that the 3 tri's (or however many) were tightly packed in the array, as another shader kernel might atomically insert one in between, but, there is nothing to stop you calling IncrementCounter a variable number of times in 1 shader. You'd still get an atomically safe list of triangles at the end of the shader.

    Thanks for pointing this out! And thanks for fleshing out my suggestion with some real code :)

    (though i would still use CopyCount, rather than hardcoding the draw count to be the entire buffer. no point drawing triangles beyond the end of the valid data set. plus they might be full of garbage data.)
     
  26. c0d3_m0nk3y

    c0d3_m0nk3y

    Joined:
    Oct 21, 2021
    Posts:
    257
    Right, the triangles might be in random order but doesn't matter as long as the 3 vertices in the right order.

    I'm not sure if GetDimensions gives you the maximum buffer size or the current count. If it's the maximum buffer size, you can do
    Code (CSharp):
    1. indirect_mesh_args[0] = buffer.IncrementCounter() * 3;
    instead. Doesn't matter that you increment the buffer count as a side effect at this point.
     
    richardkettlewell likes this.
  27. richardkettlewell

    richardkettlewell

    Unity Technologies

    Joined:
    Sep 9, 2015
    Posts:
    2,095
    GetDimensions will give you the max size. Calling IncrementCounter one more time is an interesting trick! I've always copied the counter out with CopyCount, but your way is probably more efficient.
     
  28. benoneal

    benoneal

    Joined:
    Apr 15, 2016
    Posts:
    31
    You're not being stupid at all of course. And yeah, your code example would work, as you're implementing AppendStructuredBuffer functionality yourself, since they use the same hidden counter under the hood. If I return to tessellation, I'll likely take on board your suggestion here to maintain a dense mesh buffer, thanks.

    I still think it's a major issue that this would even be required though. Unity *should* correctly cover the DX11 API for supported cards (and AMD always natively supported DX11). Cross-vendor/platform support is the single biggest selling point, at least for me. The last thing I want to deal with is cases like this, where I have to find workarounds and hand-roll patches to fill the gaps... ESPECIALLY when the underlying hardware actually supports exactly what I'm trying to do, and Unity is dropping the ball in the middle.
     
  29. richardkettlewell

    richardkettlewell

    Unity Technologies

    Joined:
    Sep 9, 2015
    Posts:
    2,095
    Ok. If you want to submit a bug report, the relevant part of the dev team can figure out whether we should fix something.

    IMO though, it's unusual to use a StructuredBuffer with a stride that is different to the stride used in the shader, and therefore reasonable to think that this could result in problematic behaviour.

    Do you know that this use case is defined in the DX11 spec? If not, it seems possible that we are just getting (partially) lucky because NVidia decided to handle it in their driver code, once it sees the resource bound vs. the shader type declaration. (speculation!)

    I'd just use the alternative suggestion, it seems safer and more conformant all round, but of course it's up to you :)
     
    Last edited: Dec 9, 2022