Search Unity

Question ComputeShader.Dispatch from off-thread or Job System?

Discussion in 'Shaders' started by funkyCoty, Oct 28, 2020.

  1. funkyCoty

    funkyCoty

    Joined:
    May 22, 2018
    Posts:
    727
    Is it possible to Dispatch compute shaders from off the main thread, or from the Job System?

    While porting to consoles, I've noticed our many many many Dispatch calls are a serious concern with performance. On PC they're pretty small (~1ms total), but on certain consoles its much higher. Profiling, its both the many dispatches and variable assignments (texture sets, float sets, etc) that are eating up much of our frame.
     
  2. holdingjason

    holdingjason

    Joined:
    Nov 14, 2012
    Posts:
    135
    Was wondering the same thing. Ever find out?
     
  3. funkyCoty

    funkyCoty

    Joined:
    May 22, 2018
    Posts:
    727
    Tried it out, turns out you cannot. I wish you could!

    For now, I'm building a command buffer and just dispatching the command buffer over and over, rather than rebuilding the commands every frame.
     
  4. tvirolai

    tvirolai

    Unity Technologies

    Joined:
    Jan 13, 2020
    Posts:
    79
    Hopefully someday. But for now it indeed is singlethreaded. A major reason for this is dependency tracking. Our API works as DX11/OpenGL would. So if you use a resource in Dispatch call then all the barriers etc must be issued correctly on platforms that require them.

    Let's use an example. I have two threads. One uses some image as an UAV and another as SRV in a compute shader. So there needs to be a barrier between them. Two threads issue a dispatch at the same time. If we just push them to the GPU (in API's that allow it, like Vulkan etc) that's an error as there is no barrier. So we'd basically have to have an intermediate layer that would wait for both of the threads, and then issue the calls internally from a single thread after looking at the dependencies. Basically bringing us back into the single threaded performance.

    For real multithreaded dispatch one needs some way of determining dependencies ahead of time so that the cases where there is a need for barriers etc are serialized but independent cases are not. We don't have such a thing yet unfortunately.
     
    LooperVFX, bajja and laurentlavigne like this.
  5. holdingjason

    holdingjason

    Joined:
    Nov 14, 2012
    Posts:
    135
    Thanks yeah that makes sense.
     
  6. holdingjason

    holdingjason

    Joined:
    Nov 14, 2012
    Posts:
    135
    That makes sense. So what about the setup before dispatch ie creation of the buffer that is passed. Example this setup is fairly costly doing it enough. Right now I believe you have to do this in the main thread. Not sure if there is any recommendations on how to do this sort of thing faster before actually dispatching. I suppose perhaps as @funkyCoty was saying to cache it off, perhaps just change what has changed instead of rebuilding it from scratch (not sure you can do that) and dispatch again.

    cameraComputeShader.SetBuffer(instanceVisibilityComputeKernelId, GPUInstancerConstants.VisibilityKernelPoperties.INSTANCE_DATA_BUFFER, runtimeData.transformationMatrixVisibilityBuffer);

    cameraComputeShader.SetFloats(GPUInstancerConstants.VisibilityKernelPoperties.BUFFER_PARAMETER_MVP_MATRIX,
    cameraData.mvpMatrixFloats);
     
  7. funkyCoty

    funkyCoty

    Joined:
    May 22, 2018
    Posts:
    727
    I REALLY wish there was a way to change an existing command buffer, instead of having to rebuild from scratch. I have a project where I'm double-buffering textures (swapping the reference after each write). My solution for this was to just have two command buffers, and swap the command buffer used rather than rebuilding it every frame.
     
  8. tvirolai

    tvirolai

    Unity Technologies

    Joined:
    Jan 13, 2020
    Posts:
    79
    Most of those calls are actually relatively light. The actual final building of the state happens currently at the moment of dispatch.

    To get some overhead out you could try using cbuffers directly and just keeping them cached. Without explicit cbuffer we basically get the OpenGL style uniform state what the SetFloats etc do and then put them into temporary one at the moment of dispatch.

    But they are tricky due to alignment and padding differences between API's that we don't have a good solution for just yet except manually padding in HLSL shaders to make them match std140 (easy way is to basically not use arrays of anything else than float4, and don't use float3 at all).

    The commandbuffer is a software construct purely on Unity side. So the actual API level command buffers are being fully built every time one submits them. That means it would be possible to change the existing commandbuffer, but it also means it's mostly useless as it's being replayed fully every time one submits it. Just having two is fine as they don't take too much memory.

    If we manage to change our CommandBuffers in a way that would be suitable for them to be actual API level commandbuffers they couldn't be changed regardless, so one would still have to have two.

    And let's say there is some sort of esoteric platform, like some console or whatnot, that would actually allow such a change. If we'd expose that then we'd have to emulate it everywhere and you can imagine what kind of caching problem it would be for us to internally maintain N commandbuffers per actual commandbuffers and then figure out which to evict and which to keep because if user does a simple doublebuffering we want that to be fast but we also don't want to keep thousands of buffers in cache. The joys of trying to expose things that can be efficiently implemented everywhere :)


    As for the original post in this thread can you file a bug about it so that the relevant team handling the console backend is aware of it? Assuming the issue is still relevant. With luck they can optimize a lot of the overhead and get it closer to PC.
     
    LooperVFX and laurentlavigne like this.
  9. Neto_Kokku

    Neto_Kokku

    Joined:
    Feb 15, 2018
    Posts:
    1,751
    Not only dispatches, in my experience. Seems every GPU call Unity does takes significant more CPU time on Switch compared to PC. While the Switch CPU is much weaker than your average Intel, its native graphics API should have less overhead than immediate mode desktop APIs like DX11.

    (But then again, Unity has notorious performance issues with low level APIs like DX12 and Vulkan).
     
  10. funkyCoty

    funkyCoty

    Joined:
    May 22, 2018
    Posts:
    727
    I wish this was true! The functions to build the unity-side command buffer really do add up, and quickly. It's easy to get into a situation where you are spending several ms simply building the unity side of the command buffer, before dispatching. This is why I mentioned previously that I am just building the command buffer once or twice and reusing/swapping where necessary. If these functions are meant to be "free" then I think Unity needs to take a second look at them sometime.
     
    laurentlavigne likes this.
  11. Neto_Kokku

    Neto_Kokku

    Joined:
    Feb 15, 2018
    Posts:
    1,751
    I see the same here: building command buffers is far from "free". A good example is PPv2. Even after forking and optimizing a lot of it, it's still taking around 2.5 ms of precious main thread time mostly building command buffers from scratch every frame on Switch.

    I really wished Unity would actually dogfood their engine with actual published games on every platform, because when people complained about PPv2's taking 1ms per frame just to needlessly interpolate volume properties the response was that 1ms was "acceptable".
     
    Last edited: Jan 9, 2021
    laurentlavigne and funkyCoty like this.
  12. tvirolai

    tvirolai

    Unity Technologies

    Joined:
    Jan 13, 2020
    Posts:
    79
    Relatively light doesn't mean necessarily light in the absolute sense :D. It's just light compared to the final setting of the state (especially on modern API's where we need to emulate OpenGL style global uniform state, which is part of the reason they are not too efficient for us).

    If you have a good test project or info on what specific commands are slow don't hesitate to file a bug about performance. Because there is no doubt that we can do better. And the more practical projects we have the better.

    But rest assured we are working on improving the performance.
     
    LooperVFX likes this.
  13. funkyCoty

    funkyCoty

    Joined:
    May 22, 2018
    Posts:
    727
    It's not just one.. it's everything. I know that sucks to hear but I'm not sure how else to say it. Every single .SetTexture or SetFloat has significant impact, and so if you are doing high-hundreds to low-thousands per frame it adds up to several ms on high end hardware.

    I could direct you to a google drive link which has our recent project (for unrelated bug reports) which heavily uses compute shaders and command buffers, but honestly I'm not sure how much would help considering its these functions themselves that are slow.
     
  14. tvirolai

    tvirolai

    Unity Technologies

    Joined:
    Jan 13, 2020
    Posts:
    79
    Have you tried https://docs.unity3d.com/2020.2/Documentation/ScriptReference/ComputeShader.SetConstantBuffer.html
    That basically allows you to bypass the whole property system and just use a fixed cbuffer in a shader. One binding call instead of multiple calls to set individual values. Ideal use is to manually precreate all the variants you use though. If one uses it in a way that's; Dispatch();Buffer.SetData();Dispatch it's not that good. Especially if SubData is updated as that will prevent internal versioning of the buffer and that will cause GPU stalls depending on backend.

    That won't help with the SetTexture however usually there are way more individual values in a shader than textures.

    HDRP moved to use the cbuffer API precisely because the property system is so slow. It's something that's hard to make any faster than it is, which is unfortunate as it's still intuitive and easy.
     
  15. edeguine

    edeguine

    Joined:
    Jul 2, 2015
    Posts:
    4
    Is there any update on "running a compute shader off the main thread"?
    I am doing mesh deformation / topology in a compute shader, and it's just slow enough that I would prefer to not do it on the render thread.

    I tried a hacky solution where I send the compute shader action from the background thread to a MonoBehavior that processes it in its Update() function on main thread then gets it back but it's is costing one frame every time when the job only takes 1/5 of a frame so it's not great, especially because I need to look at the results of the compute shader and issue one or two more calls with different input to complete the job.

    Any solution would be great. I am developing for the Quest 2/3 (so Android + Vulkan)
     
  16. aleksandrk

    aleksandrk

    Unity Technologies

    Joined:
    Jul 3, 2017
    Posts:
    3,028
    edeguine likes this.