Search Unity

Question DrawMeshInstanced much faster than DrawMeshInstancedIndirect

Discussion in 'General Graphics' started by Desoxi, Jul 22, 2021.

  1. Desoxi

    Desoxi

    Joined:
    Apr 12, 2015
    Posts:
    195
    Hey everyone,

    I am currently trying to make things work with DrawMeshInstancedIndirect instead of DrawMeshInstanced and it kinda works (on the Oculus Quest). The only caveat is.. it is extremely slow. By profiling it, I can see that there is a huge amount of vertex shading going on.
    Well, the thing is, I am using the exact same shader, but to make the indirect call work I had to add the following lines as usual:

    Code (CSharp):
    1.  
    2. #pragma instancing_options procedural:setup
    3.  
    4. struct Properties
    5. {
    6.     float4x4 objectToWorldMatrix;
    7.     float4x4 worldToObjectMatrix;
    8. };
    9.  
    10. CBUFFER_START(UnityPerMaterial)
    11.     float4 _BaseMap_ST;
    12.     half4 _BaseColor;
    13.     half4 _SpecColor;
    14.     half4 _EmissionColor;
    15.     half _Cutoff;
    16.     half _Surface;
    17. #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
    18. StructuredBuffer<Properties> _MatrixProperties;
    19. #endif
    20.  
    21. CBUFFER_END
    22.  
    with the setup function looking like this:

    Code (CSharp):
    1. void setup()
    2. {
    3.     #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
    4.     Properties data = _MatrixProperties[unity_InstanceID];
    5.     unity_ObjectToWorld = data.objectToWorldMatrix;
    6.     unity_WorldToObject = data.worldToObjectMatrix;
    7.     #endif
    8. }
    I am calculating the objectToWorld and worldToObject matrices in CPU world and sending it via the ComputeBuffer to prevent the GPU from calculating it for each instance. But unfortunately, that doesn't really help.

    The thing I don't understand is, why is this creating so much impact on the vertex stage, but using the DrawMeshInstanced (not indirect) version, works perfectly?

    I wanted to use the indirect version because I have more than 1023 objects to render, but unfortunately, it looks like doing several non-indirect calls to render everything is faster than one indirect call. To my knowledge, DrawMeshInstanced is just a wrapper around DrawMeshInstancedIndirect. Maybe I am using it in a completely wrong way at the moment?

    Any help and insight are much appreciated!

    P.S.: Even rendering much fewer objects with DrawMeshInstancedIndirect (count 200) has a worse performance than DrawMeshInstanced (count 1023).
     
    Last edited: Jul 22, 2021
    bb8_1 and harryzzzzz like this.
  2. Desoxi

    Desoxi

    Joined:
    Apr 12, 2015
    Posts:
    195
    To keep you up to date: I tried lowering the needed bandwidth a bit by using a StructuredBuffer<float3> instead of the StructuredBuffer<Properties>, where the Properties struct contains 2 float4x4's.

    This lead to improved performance, though, still not the exact same like when using the non-indirect version.
    It could be related to the fact, that DrawMeshInstanced is using constant buffers (constant registers), which seem to be blazing fast but limited to a specific size.
    Whereas as we can see, DrawMeshInstancedIndirect uses structured buffers, for which the first read is slow but all subsequent reads are very fast.

    Correct me if I am wrong here though!
     
    bb8_1 and harryzzzzz like this.
  3. Desoxi

    Desoxi

    Joined:
    Apr 12, 2015
    Posts:
    195
    So after testing around a bit, I was now able to render over 2000 objects with DrawMeshInstancedIndirect at CPU/GPU levels 2/2 at steady 72 fps. The reason it had such a hard time before was that there were simply too many vertices for the GPU to handle. I used a icosphere with around 300 verts, which seems to be too much on the Quest at the moment with this amount of drawn objects. Using cubes or any mesh with much lower verts it works like a charm.

    I compared my solution to draw the same amount with multiple DrawMeshInstanced() calls to the solution of one DrawMeshInstancedIndirect() call and the fps, in the end, is still stable (72fps).
    Also, the render time does not change too much. But interestingly, the fillrate using the indirect call was much better, meaning, it has less overdraw than the DrawMeshInstanced version.