Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

Question Can gpu's process multiple draw calls concurently?

Discussion in 'General Graphics' started by VictorKs, Feb 21, 2023.

  1. VictorKs

    VictorKs

    Joined:
    Jun 2, 2013
    Posts:
    242
    So lets say I have a draw call of a mesh with 10 triangles/30 verts which only cover a small portion of the scene.
    So that will mean 30 vertex shader invocations and lets say it occupies 4x4 fragments thus 64 pixel shader invocations. So assuming a warp size of 32 threads (Nvidia gpu) the above process will only occupy 3 warps. So do other draw calls get processed concurrently or the rest of the gpu simply stalls?
    I know that compute kernels can get dispatched concurrently. But I am wondering what are the limitations for normal rendering.
     
  2. VictorKs

    VictorKs

    Joined:
    Jun 2, 2013
    Posts:
    242
    I am 99.9% sure there is concurrent execution I'm just not sure about the fixed pipeline stages. I mean the rasterizer and the Blending in the ROPs. So to rephrase my question "Can the fixed pipelines process different draw calls with different states?"

    Maybe this is a stupid or overly technical question but I am wondering because I fear low vert/frag draw calls will occupy an entire Streaming Multiprocessor.
     
    Last edited: Feb 21, 2023
  3. c0d3_m0nk3y

    c0d3_m0nk3y

    Joined:
    Oct 21, 2021
    Posts:
    551
    I can't give you a full answer but here are my 2 cents:

    Yes, draw calls are usually processed concurrently. You can see this in PIX, for example. The dark blue line is the selected draw call, the bright blue line is the draw call that I am hovering over with the mouse:
    upload_2023-2-21_19-15-11.png

    You can also see that CS (compute shader) warps, PS (pixel shader) warps and VTG (vertex, tesselation, geometry) warps are executed concurrently. There are some unallocated warps between render passes:
    upload_2023-2-21_19-16-39.png

    There is a guarantee that the outcome must be as if the draw calls had been executed serially but in most cases, the GPU can give that guarantee without actually executing them serially. You'll notice that the order is random when you try to write to the same address of an UAV in a pixel shader. You have to use ROV (rasterizer order views) to get that guarantee.

    Having said that, there are some cases where the GPU cannot overlap draw calls. If blending is enabled, the output merger stage must blend the results in order (but the pixel shaders can still run concurrently it just has to buffer the result). If you use a render target or compute buffer as input, you have to make sure the GPU is done with rendering to them first by adding a resource barrier. In DX11 this is done automatically, in DX12, it's your responsibility.

    How does the GPU make sure the result is "in order" despite executing the draw calls concurrently? Not sure, but I have two ideas:
    - It could rely on the depth buffer if there is one and blending is disabled. That would explain z-fighting if you have multiple overlapping triangles
    - Otherwise, it could pause the warp at the end to until it is time. This means, the warp would still occupy resources.

    My guess is (don't know for sure) that you will get a stall if you change the render state (PSO) so only draw calls with the same render state will overlap.
     
    Last edited: Feb 22, 2023
    VictorKs likes this.
  4. Neto_Kokku

    Neto_Kokku

    Joined:
    Feb 15, 2018
    Posts:
    1,751
    They do. If you use a vendor-specific GPU debugger/profiler like Nvidia NSight you can see this in their timeline view, where many draw calls will actually overlap with each other.

    This actually makes fine-grained GPU optimization not straightforward since what can and cannot be parallelized and by how much varies a lot from between GPU models and vendors.

    Reducing draw calls is ultimately a CPU optimization, not a GPU one.
     
    VictorKs likes this.
  5. VictorKs

    VictorKs

    Joined:
    Jun 2, 2013
    Posts:
    242
    This is what I expect too, maybe it is not a complete stall maybe it occupies only certain units, but all this is too low level for my understanding. I understand the process very well up until rasterization, this is where things get fuzzy for me. Nvidia whitepapers say that the fragment warps are transfered to the appropriate Processing Cluster depending on screen space position. So I guess they split fragment shader work spatially. In Compute shaders they organize all warps of a warpgroup, in a single Streaming Multiprocessor so they can use shared memory and intercommunicate. I wouldn't be surprised if they used a similar logic for certain parts of the pipeline, like not mixing warps from different shaders in the same SM. So maybe there are some warp batching rules and limitations (or maybe not!) :D
    Thankfully it works though!

    Haven't used it yet but I will take a look I use Renderdoc because of easy Unity integration it is not that good for gpu performance though. But works very well for shader debugging and API calls. But you are right draw calls is a CPU optimization.
     
  6. c0d3_m0nk3y

    c0d3_m0nk3y

    Joined:
    Oct 21, 2021
    Posts:
    551
    Yes, a fragment shader warp is a block (rectangle) of neighboring pixels on the screen.

    It's hard to find good information on such low level details because it's part of their secret sauce and could differ from vendor to vendor.

    A few years ago, somebody discovered that NVidia now uses tiled-based rendering even on desktop GPUs:
    https://www.anandtech.com/show/10536/nvidia-maxwell-tile-rasterization-analysis