Search Unity

Why is AsyncGPUReadback limited to the main thread?

Discussion in 'General Graphics' started by LightStriker, Jul 3, 2019.

  1. LightStriker

    LightStriker

    Joined:
    Aug 3, 2013
    Posts:
    2,717
    I know, odd question, but the purpose of AsyncGPUReadback is to offset something on the GPU.

    But WaitForCompletion can only be called on the main thread, and it may hang up the thread for a while.

    Would be nice to have the main thread free while we wait for the GPU.

    WaitForCompletion is a very odd beast, as when you call it can be be quick, or take a long while. It depends of the GPU queue and other things I'm not too sure about.
     
  2. richardkettlewell

    richardkettlewell

    Unity Technologies

    Joined:
    Sep 9, 2015
    Posts:
    2,285
  3. LightStriker

    LightStriker

    Joined:
    Aug 3, 2013
    Posts:
    2,717
    I need it later in the same frame as when it was called. Any clue how to do that without blocking the main thread?

    Since it returns NativeArray<>, would have been nice to offset it on a different thread.
     
    Last edited: Jul 5, 2019
  4. richardkettlewell

    richardkettlewell

    Unity Technologies

    Joined:
    Sep 9, 2015
    Posts:
    2,285
    If you need the result on the same frame, ultimately the main thread is going to have to risk waiting for it with a call to WaitForCompletion. All you can do is call it as late as possible for your use case.

    The high performance use case is never to need gpu data on the cpu on the same frame, but rather for it to be ok for the data to arrive a small number of frames later, so you can use the “done” property and wait another frame if not ready.
     
  5. LightStriker

    LightStriker

    Joined:
    Aug 3, 2013
    Posts:
    2,717
    Sadly, no can't do. We use it for depth culling.

    We tried to dispatch as early as we can, and WaitForCompletion as late as possible... but that turned out to be rather bad, because how long it's going to wait is highly dependent on what's happening with the CPU/GPU at that moment.

    Using PIX, we found out that our process only takes 0.22ms to complete on Xbox One. Calling Wait right away, makes it stall the CPU for 1-1.2 ms. If we call it later - say after processing physic - we sometime stall for MUCH longer, and the GPU is busy doing something else, such as updating Skinned objects.

    Again... Why is this main thread only? It returns a NativeArray, which since it's used in Jobs, I would assume is thread safe. Or maybe have a AsyncGPUReadback that forces readback as soon as possible.
     
  6. richardkettlewell

    richardkettlewell

    Unity Technologies

    Joined:
    Sep 9, 2015
    Posts:
    2,285
    I’m not sure why this question matters. If you need it on the same frame, then, if you could run it on a thread, you’d still need to sync (join) that thread at some point? Which is the same as calling WaitForCompletion at that point, isn’t it?

    Anyway, all unity script api that interacts with graphics must run on the main thread currently.
     
  7. richardkettlewell

    richardkettlewell

    Unity Technologies

    Joined:
    Sep 9, 2015
    Posts:
    2,285
    There is no way to ask the GPU to do it any sooner. It is done as soon as possible already.
     
    Last edited: Jul 6, 2019
  8. LightStriker

    LightStriker

    Joined:
    Aug 3, 2013
    Posts:
    2,717
    Because it wouldn't lockup the main thread. I could run physic, AI, etc while waiting to get back the culling result. I have literally 4ms of stuff to do between dispatching and before I start rendering stuff.

    In some way, WaitForCompletion has an issue. PIX shows me my dispatch is handled right away, and takes 0.22ms to complete. But for some reason, if I put WaitForCompletion 4 ms later, it still lock up the main thread for 0.5-1.8ms.

    I see here that WaitForCompletion does more than just waiting for "done" to be true. It calls Gfx.UpdateAsyncReadbackData. Would it be possible to have that Gfx.UpdateAsyncReadbackData call done right away without locking up the main thread?
     
  9. richardkettlewell

    richardkettlewell

    Unity Technologies

    Joined:
    Sep 9, 2015
    Posts:
    2,285
    That sounds like how long it takes for the Compute Shader (or whatever) to run on the GPU. After that, the results must be transferred back across the PCIe bus to the CPU. This is super slow because that whole data interface is designed for quickly sending data from the CPU to the GPU, not the other way around. Is PIX also able to measure the transfer time?

    I still don’t understand why you can’t run your physics/ai/whatever and call WaitForCompletion afterwards? I.e. Request the data as soon as possible, and wait for it to complete as late as possible.
     
  10. LightStriker

    LightStriker

    Joined:
    Aug 3, 2013
    Posts:
    2,717
    I'm sorry... But what? That interface (PCIe) is as fast in both direction. That's why we have 3Gb/s SSD in a M.2 slot. The data involved here is 3000 floats. It's 12kb of data. It's nothing. Even at 60 frame per second, it's only 720kb.

    That's exactly what I do. It still hang the CPU for a while. On XboxOne, I get up to 9-10ms of the main thread waiting... and that's with the GPU ended up the task 10-12 ms earlier. Currently on Xbox, the GPU is busy 15-20% of the time.

    I verified with a friend if I wasn't crazy - 3D Prog on Assassin's Creed - and he concurs; that read back should take 0.1ms top on XboxOne, not 10 ms. Something's fishy.
     
    Last edited: Jul 13, 2019
  11. richardkettlewell

    richardkettlewell

    Unity Technologies

    Joined:
    Sep 9, 2015
    Posts:
    2,285
    At this point i don’t think we are going to make any more progress on agreeing what this feature is capable of performance wise on Xbox.

    If you think the Xbox can perform your async readback faster than it currently is, set up a minimal repro project and submit a bug report, along with your performance expectations, for our Xbox team to look at.

    Regarding the speed of PCIe and the whole 0.1ms thing - I oversimplified the problem. The issue is not one of bandwidth but rather latency. It would have been more accurate for me to not refer to it as a limitation of PCIe. I recommend doing some googling about the topic eg I found this very quickly and from a skim-read appears to cover the topic well: https://community.khronos.org/t/why-is-gpu-cpu-transfer-slow/58708

    Best of luck - I hope you find a way to make it as fast as you need.
     
    Last edited: Jul 13, 2019
  12. LightStriker

    LightStriker

    Joined:
    Aug 3, 2013
    Posts:
    2,717
    I know all this, and it would make perfect sense if the GPU had tasks to perform already queued. But it's not the case. I have a starving GPU that the CPU can't keep up feeding. Even more, when we Dispatch, there's nothing else in the GPU queue at that moment. It's processed right away in 0.22ms. And then the next task happens 3-4ms second later with updating skinned meshes - post physic update.

    But let's get back to the main point of this thread; why is there no "get result right away" method that is NOT main thread blocking.

    Am I being unclear here? Something similar to WaitForCompletion, but would be WaitForCompletionAsync.

    I want the results as soon as possible - similar to WaitForCompletion - but I don't want to block the main thread while waiting. Is it dumb? If so, why?
     
  13. richardkettlewell

    richardkettlewell

    Unity Technologies

    Joined:
    Sep 9, 2015
    Posts:
    2,285
    No you aren’t being unclear - I think I answered that point some time ago..

    We can take idea this into consideration for when we are able to offer a renderer that can communicate with multiple script threads, thanks for suggesting it.
     
  14. LightStriker

    LightStriker

    Joined:
    Aug 3, 2013
    Posts:
    2,717
    Hmm.. I think I missed that bit, sorry.
     
  15. funkyCoty

    funkyCoty

    Joined:
    May 22, 2018
    Posts:
    727
    I feel like there was a fundamental misunderstanding here, and the topic was never really resolved.

    It sounds like @LightStriker wants to do the following: [Dispatch some work] [game code runs] [Get the result], like @richardkettlewell suggests. However, the issue here is that the [Get the result] step is taking way longer than it should. Richard mentioned latency, and I think that is the main factor thats wrong here. No matter how long you wait to [Get the result], the GPU is going to be busy doing something at the time you actually call it. You need to stall, wait for it to finish, and then actually get the result. Because the GPU is always going to be busy doing something, you're going to have that latency.

    I think what LightStriker is suggesting is an off-thread alternative so that we can get the data (introduce that sync point) from the GPU. Yeah, this is still going to stall the gpu because it needs to stop what its doing and send data to the cpu. But, the point here is that the main thread will not be blocked during this latency.

    There is AsyncGPUReadbackRequest callback, but it seems to still have the same issue. At some point on the main thread, it asks for data back from the GPU, and in doing so it seems the GPU has to finish up whatever its currently schedule to do first, so the request's GetData can be randomly pretty slow. Depending on the game, you may get lucky and request the data when the GPU isnt actually busy, and in those scenarios its as fast as it should be.
     
  16. richardkettlewell

    richardkettlewell

    Unity Technologies

    Joined:
    Sep 9, 2015
    Posts:
    2,285
    Unfortunately, Unity doesn't really support using its API off the main thread. There are a small number of exceptions, but most stuff, especially anything graphics related, must be called from the main scripting thread.

    No, it doesn't, because this new API breaks GetData up into multiple steps, giving you greater control. The steps performed by GetData are:

    1. Request the data
    2. Wait for the GPU to send the data back to the CPU
    3. Return the data

    With AsyncGPUReadbackRequest , you can issue step 1, but, instead of waiting in step 2, carry on with your app, and periodically ask Unity "Hey, did you get my data back from the GPU yet?". If the answer is yes, you can get the data (step 3) with no delay. If it's not ready, you should wait a bit before asking again (eg on the next frame).

    I thought I addressed that by asking for a bug report? :)
     
    Last edited: Oct 29, 2020
  17. Neto_Kokku

    Neto_Kokku

    Joined:
    Feb 15, 2018
    Posts:
    1,751
    AFAIK, even on consoles, getting GPU data to the CPU at the same frame is going to have a cost. Even if you could wait in another thread, there's no guarantee the data will be available before the point in the frame where the main thread needs the data (if you're doing culling, this is before preparing the next frame for rendering) unless you force a GPU stall.

    If you look at games that use GPU occlusion queries, most of them use data from the previous frame, which is why sometimes a fast camera turn causes objects to pop in. The only way to use the data on the same frame reliably is when your rendering is also managed by the GPU, using indirect rendering or DX12/Vulkan indirect execution to pipeline the data.
     
    sabojako likes this.