Why is AsyncGPUReadback limited to the main thread?

LightStriker · Jul 3, 2019

I know, odd question, but the purpose of AsyncGPUReadback is to offset something on the GPU.

But WaitForCompletion can only be called on the main thread, and it may hang up the thread for a while.

Would be nice to have the main thread free while we wait for the GPU.

WaitForCompletion is a very odd beast, as when you call it can be be quick, or take a long while. It depends of the GPU queue and other things I'm not too sure about.

richardkettlewell · Jul 5, 2019

You should try to avoid using the Wait method. You should check if it’s completed instead, with the non blocking “done” property.

https://docs.unity3d.com/ScriptReference/Rendering.AsyncGPUReadbackRequest-done.html

LightStriker · Jul 5, 2019

richardkettlewell said: ↑

You should try to avoid using the Wait method. You should check if it’s completed instead, with the non blocking “done” property.

https://docs.unity3d.com/ScriptReference/Rendering.AsyncGPUReadbackRequest-done.html
Click to expand...

I need it later in the same frame as when it was called. Any clue how to do that without blocking the main thread?

Since it returns NativeArray<>, would have been nice to offset it on a different thread.

richardkettlewell · Jul 5, 2019

If you need the result on the same frame, ultimately the main thread is going to have to risk waiting for it with a call to WaitForCompletion. All you can do is call it as late as possible for your use case.

The high performance use case is never to need gpu data on the cpu on the same frame, but rather for it to be ok for the data to arrive a small number of frames later, so you can use the “done” property and wait another frame if not ready.

LightStriker · Jul 5, 2019

richardkettlewell said: ↑

If you need the result on the same frame, ultimately the main thread is going to have to risk waiting for it with a call to WaitForCompletion. All you can do is call it as late as possible for your use case.

The high performance use case is never to need gpu data on the cpu on the same frame, but rather for it to be ok for the data to arrive a small number of frames later, so you can use the “done” property and wait another frame if not ready.
Click to expand...

Sadly, no can't do. We use it for depth culling.

We tried to dispatch as early as we can, and WaitForCompletion as late as possible... but that turned out to be rather bad, because how long it's going to wait is highly dependent on what's happening with the CPU/GPU at that moment.

Using PIX, we found out that our process only takes 0.22ms to complete on Xbox One. Calling Wait right away, makes it stall the CPU for 1-1.2 ms. If we call it later - say after processing physic - we sometime stall for MUCH longer, and the GPU is busy doing something else, such as updating Skinned objects.

Again... Why is this main thread only? It returns a NativeArray, which since it's used in Jobs, I would assume is thread safe. Or maybe have a AsyncGPUReadback that forces readback as soon as possible.

richardkettlewell · Jul 6, 2019

LightStriker said: ↑

Again... Why is this main thread only?
Click to expand...

I’m not sure why this question matters. If you need it on the same frame, then, if you could run it on a thread, you’d still need to sync (join) that thread at some point? Which is the same as calling WaitForCompletion at that point, isn’t it?

Anyway, all unity script api that interacts with graphics must run on the main thread currently.

richardkettlewell · Jul 6, 2019

LightStriker said: ↑

Or maybe have a AsyncGPUReadback that forces readback as soon as possible.
Click to expand...

There is no way to ask the GPU to do it any sooner. It is done as soon as possible already.

LightStriker · Jul 6, 2019

richardkettlewell said: ↑

I’m not sure why this question matters. If you need it on the same frame, then, if you could run it on a thread, you’d still need to sync (join) that thread at some point? Which is the same as calling WaitForCompletion at that point, isn’t it?

Anyway, all unity script api that interacts with graphics must run on the main thread currently.
Click to expand...

Because it wouldn't lockup the main thread. I could run physic, AI, etc while waiting to get back the culling result. I have literally 4ms of stuff to do between dispatching and before I start rendering stuff.

In some way, WaitForCompletion has an issue. PIX shows me my dispatch is handled right away, and takes 0.22ms to complete. But for some reason, if I put WaitForCompletion 4 ms later, it still lock up the main thread for 0.5-1.8ms.

I see here that WaitForCompletion does more than just waiting for "done" to be true. It calls Gfx.UpdateAsyncReadbackData. Would it be possible to have that Gfx.UpdateAsyncReadbackData call done right away without locking up the main thread?

richardkettlewell · Jul 6, 2019

LightStriker said: ↑

PIX shows me my dispatch is handled right away, and takes 0.22ms to complete.
Click to expand...

That sounds like how long it takes for the Compute Shader (or whatever) to run on the GPU. After that, the results must be transferred back across the PCIe bus to the CPU. This is super slow because that whole data interface is designed for quickly sending data from the CPU to the GPU, not the other way around. Is PIX also able to measure the transfer time?

LightStriker said: ↑

Because it wouldn't lockup the main thread. I could run physic, AI, etc while waiting to get back the culling result.
Click to expand...

I still don’t understand why you can’t run your physics/ai/whatever and call WaitForCompletion afterwards? I.e. Request the data as soon as possible, and wait for it to complete as late as possible.

LightStriker · Jul 13, 2019

richardkettlewell said: ↑

That sounds like how long it takes for the Compute Shader (or whatever) to run on the GPU. After that, the results must be transferred back across the PCIe bus to the CPU. This is super slow because that whole data interface is designed for quickly sending data from the CPU to the GPU, not the other way around. Is PIX also able to measure the transfer time?
Click to expand...

I'm sorry... But what? That interface (PCIe) is as fast in both direction. That's why we have 3Gb/s SSD in a M.2 slot. The data involved here is 3000 floats. It's 12kb of data. It's nothing. Even at 60 frame per second, it's only 720kb.

richardkettlewell said: ↑

I still don’t understand why you can’t run your physics/ai/whatever and call WaitForCompletion afterwards? I.e. Request the data as soon as possible, and wait for it to complete as late as possible.
Click to expand...

That's exactly what I do. It still hang the CPU for a while. On XboxOne, I get up to 9-10ms of the main thread waiting... and that's with the GPU ended up the task 10-12 ms earlier. Currently on Xbox, the GPU is busy 15-20% of the time.

I verified with a friend if I wasn't crazy - 3D Prog on Assassin's Creed - and he concurs; that read back should take 0.1ms top on XboxOne, not 10 ms. Something's fishy.

richardkettlewell · Jul 13, 2019

At this point i don’t think we are going to make any more progress on agreeing what this feature is capable of performance wise on Xbox.

If you think the Xbox can perform your async readback faster than it currently is, set up a minimal repro project and submit a bug report, along with your performance expectations, for our Xbox team to look at.

Regarding the speed of PCIe and the whole 0.1ms thing - I oversimplified the problem. The issue is not one of bandwidth but rather latency. It would have been more accurate for me to not refer to it as a limitation of PCIe. I recommend doing some googling about the topic eg I found this very quickly and from a skim-read appears to cover the topic well: https://community.khronos.org/t/why-is-gpu-cpu-transfer-slow/58708

Best of luck - I hope you find a way to make it as fast as you need.

LightStriker · Jul 13, 2019

richardkettlewell said: ↑

At this point i don’t think we are going to make any more progress on agreeing what this feature is capable of performance wise on Xbox.

If you think the Xbox can perform your async readback faster than it currently is, set up a minimal repro project and submit a bug report, along with your performance expectations, for our Xbox team to look at.

Regarding the speed of PCIe and the whole 0.1ms thing - I oversimplified the problem. The issue is not one of bandwidth but rather latency. It would have been more accurate for me to not refer to it as a limitation of PCIe. I recommend doing some googling about the topic eg I found this very quickly and from a skim-read appears to cover the topic well: https://community.khronos.org/t/why-is-gpu-cpu-transfer-slow/58708

Best of luck - I hope you find a way to make it as fast as you need.
Click to expand...

A GPU -> CPU readback introduces a “sync point” where the CPU must wait for the GPU to complete its calculations. During this time, the CPU stops feeding the GPU with data, causing it to stall.

Now, remember that a modern GPU is designed in a highly parallel manner, with thousand threads in flight at any given moment. The sync point must wait for all those threads to finish processing, before it can readback the result of their calculations. Once the readback is complete, all those threads must restart execution from zero… bad!
Click to expand...

I know all this, and it would make perfect sense if the GPU had tasks to perform already queued. But it's not the case. I have a starving GPU that the CPU can't keep up feeding. Even more, when we Dispatch, there's nothing else in the GPU queue at that moment. It's processed right away in 0.22ms. And then the next task happens 3-4ms second later with updating skinned meshes - post physic update.

But let's get back to the main point of this thread; why is there no "get result right away" method that is NOT main thread blocking.

Am I being unclear here? Something similar to WaitForCompletion, but would be WaitForCompletionAsync.

I want the results as soon as possible - similar to WaitForCompletion - but I don't want to block the main thread while waiting. Is it dumb? If so, why?

richardkettlewell · Jul 13, 2019

LightStriker said: ↑

But let's get back to the main point of this thread; why is there no "get result right away" method that is NOT main thread blocking.

Am I being unclear here? Something similar to WaitForCompletion, but would be WaitForCompletionAsync.
Click to expand...

No you aren’t being unclear - I think I answered that point some time ago..

richardkettlewell said: ↑

Anyway, all unity script api that interacts with graphics must run on the main thread currently.
Click to expand...

We can take idea this into consideration for when we are able to offer a renderer that can communicate with multiple script threads, thanks for suggesting it.

LightStriker · Jul 13, 2019

richardkettlewell said: ↑

No you aren’t being unclear - I think I answered that point some time ago..

We can take idea this into consideration for when we are able to offer a renderer that can communicate with multiple script threads, thanks for suggesting it.
Click to expand...

Hmm.. I think I missed that bit, sorry.

funkyCoty · Oct 28, 2020

I feel like there was a fundamental misunderstanding here, and the topic was never really resolved.

It sounds like @LightStriker wants to do the following: [Dispatch some work] [game code runs] [Get the result], like @richardkettlewell suggests. However, the issue here is that the [Get the result] step is taking way longer than it should. Richard mentioned latency, and I think that is the main factor thats wrong here. No matter how long you wait to [Get the result], the GPU is going to be busy doing something at the time you actually call it. You need to stall, wait for it to finish, and then actually get the result. Because the GPU is always going to be busy doing something, you're going to have that latency.

I think what LightStriker is suggesting is an off-thread alternative so that we can get the data (introduce that sync point) from the GPU. Yeah, this is still going to stall the gpu because it needs to stop what its doing and send data to the cpu. But, the point here is that the main thread will not be blocked during this latency.

There is AsyncGPUReadbackRequest callback, but it seems to still have the same issue. At some point on the main thread, it asks for data back from the GPU, and in doing so it seems the GPU has to finish up whatever its currently schedule to do first, so the request's GetData can be randomly pretty slow. Depending on the game, you may get lucky and request the data when the GPU isnt actually busy, and in those scenarios its as fast as it should be.

richardkettlewell · Oct 29, 2020

funkyCoty said: ↑

an off-thread alternative
Click to expand...

Unfortunately, Unity doesn't really support using its API off the main thread. There are a small number of exceptions, but most stuff, especially anything graphics related, must be called from the main scripting thread.

funkyCoty said: ↑

There is AsyncGPUReadbackRequest callback, but it seems to still have the same issue.
Click to expand...

No, it doesn't, because this new API breaks GetData up into multiple steps, giving you greater control. The steps performed by GetData are:

1. Request the data
2. Wait for the GPU to send the data back to the CPU
3. Return the data

With AsyncGPUReadbackRequest , you can issue step 1, but, instead of waiting in step 2, carry on with your app, and periodically ask Unity "Hey, did you get my data back from the GPU yet?". If the answer is yes, you can get the data (step 3) with no delay. If it's not ready, you should wait a bit before asking again (eg on the next frame).

funkyCoty said: ↑

However, the issue here is that the [Get the result] step is taking way longer than it should.
Click to expand...

I thought I addressed that by asking for a bug report?

Neto_Kokku · Oct 29, 2020

AFAIK, even on consoles, getting GPU data to the CPU at the same frame is going to have a cost. Even if you could wait in another thread, there's no guarantee the data will be available before the point in the frame where the main thread needs the data (if you're doing culling, this is before preparing the next frame for rendering) unless you force a GPU stall.

If you look at games that use GPU occlusion queries, most of them use data from the previous frame, which is why sometimes a fast camera turn causes objects to pop in. The only way to use the data on the same frame reliably is when your rendering is also managed by the GPU, using indirect rendering or DX12/Vulkan indirect execution to pipeline the data.

Search Unity

Why is AsyncGPUReadback limited to the main thread?

LightStriker

richardkettlewell

Unity Technologies

LightStriker

richardkettlewell

Unity Technologies

LightStriker

richardkettlewell

Unity Technologies

richardkettlewell

Unity Technologies

LightStriker

richardkettlewell

Unity Technologies

LightStriker

richardkettlewell

Unity Technologies

LightStriker

richardkettlewell

Unity Technologies

LightStriker

funkyCoty

richardkettlewell

Unity Technologies

Neto_Kokku

Search Unity

Unity ID

Useful Searches

Why is AsyncGPUReadback limited to the main thread?

Unity Technologies

Unity Technologies

Unity Technologies

Unity Technologies

Unity Technologies

Unity Technologies

Unity Technologies

Unity Technologies