Is there any research being done by UT on Burst to GPU assembly generation?

Arowx · Apr 4, 2018

GPU's are coming on in leaps and bounds and have massive parallism and Terraflops of processing power which is mainly used for rendering however there is the time window between frames when a GPU is probably sitting there doing nothing waiting for the next set of scene data to process.

What if the Unity development of a super fast and light weight multi-threaded ECS Burst system could target the GPU?

In theory currently games generate transform updated on the CPU that is then passed to the GPU. What if some of those calculations stayed on the GPU?

E.g. particle systems have already been shown to have massive performance improvements when running on the GPU.

Entire gravitational galaxy simulations have been running on GPU's for some time. So allowing developers to write ECS code that could be targeted at the CPU or GPU would be amazing IMHO.

Is there any research being done by UT on Burst to GPU assembly generation?

elbows · Apr 5, 2018

I know what you mean, but at this stage I'd rather keep them separated and let each system be optimised for the type of computing resources its really designed for. Especially since you can get performance issues with areas involving data being sent between cpu and gpu land.

On the gpu side right now, I'm very happy with unitys use of compute shaders. And there is a graph/node based GPU VFX system coming to Unity which we should hear more about this summer.

It would be great if future Unity graph programming systems included ECS one day, done in a manner that helps people get their heads round entity stuff and parallel programming. And if all goes well then maybe there is potential to merge certain aspects one day, but I'm not sure I really expect that. After all, one of the things OpenCL had going for it was the ability to target cpu and gpu resources, but this didnt exactly set the world on fire.

elbows · Apr 5, 2018

Also I'm often far more likely to have run out of spare GPU capacity than CPU capacity, and continual evolution of what can be done and whats expected in terms of rendering seems to keep it that way.

I do have the luxury of making arty type projects where I can target particular powerful PC hardware and so I am able to go bonkers with compute-shader based effects and simulations. Hard to make the same systems truly game ready though when they are eating such a large chunk of the overall pie.

Also a cautionary tale - some of the fancier GameWorks modules have not always been well received in reviews of games that use them. Especially framerate-obsessed reviews, where the additional load of something like Flow or FleX on the GPU is noticed and not necessarily appreciated

elbows · Apr 5, 2018

Plus, although I'm not knowledgeable enough to accurately convey the detail to other people, there are enough architectural differences between computing on the GPU and the CPU that I'm not sure it is really worth trying, the sacrifices may easily outweigh any gains. Latency differences. memory differences, difference in number of parallel 'threads' available, these things add up and tend to make CPus and GPUs suited to different tasks, some overlap but not as much as some probably think.

hippocoder · Apr 5, 2018

In a game development context, the GPU is usually used to offload calculations that will remain on GPU for example per-triangle packing and culling is done with a compute shader on Frostbite. Things like drawing vegetation and culling or particle systems are also real cpu-and-gpu saving things.

Most modern game engine designs identify what problems can be shifted to compute. This alone is the CPU saving you are looking for. Mixing both would only be a win for a very specific design.

ECS is a general purpose design that will work on all hardware regardless of the GPU. Consider the scenario where you want ECS to do more work because the GPU is already maxed out for calculations.

ALU limits on GPU are quite generous but memory bandwidth is still far too low. And we want to minimise communication between GPU and CPU for current GPU architecture and hardware. This means offloading whole problems to the GPU with the idea that the results are really only needed by the GPU. This means mixing it with ECS would probably force both to be weaker at present.

Obviously just thoughts as I've not done any research at all, but from what I know the current behaviour of offloading complete problems to compute that don't need to be read back is the best use of resources for most things.

ECS is a genius design coupled with jobs and networking. Imagine networking waiting for a GPU which will stall if you request immediate results?

elbows · Apr 5, 2018

Perhaps the term GPGPU is a little misleading. Sure, far more general purpose than GPUs were once upon a time, but still not really the sensible option for all general purposes. Everything in its right place.

Arowx · Apr 5, 2018

People mention the CPU GPU bandwidth bottleneck but how slow is it?

It's the PCI x 16 on modern PC hardware which has a potential of about 8 GB/s (https://en.wikipedia.org/wiki/PCI_Express) PCI Express v 2+ (2007) newer version have higher bandwidth...

PCI x 16
v2 (2007) 8.0 GB/s or 130 MB @ 60 hz
v3 (2010) 15.8 GB/s or 260 MB @ 60 hz
v4 (2017) 31.5 GB/s or 520 MB @ 60 hz
v5 (?2019) 63 GB/s or 1.05 GB @ 60 hz

Wow it's really quite low per a frame or if you wanted 'intra' frame compute this could be halved again as your sharing this with the rendering pipeline.

What about APU's or where they embed the CPU/GPU and RAM on the same socket/chip allowing the GRAM to be shared by the CPU?

Arowx · Apr 5, 2018

Wait a minute put this into context 30 GB/s is a good DRAM speed, yet this is classed as a bottleneck in throughput???

Or is it that GPU's don't have memory caches the way CPU's do to speed up throughput?

LennartJohansen · Apr 5, 2018

Arowx said: ↑

Wait a minute put this into context 30 GB/s is a good DRAM speed, yet this is classed as a bottleneck in throughput???

Or is it that GPU's don't have memory caches the way CPU's do to speed up throughput?
Click to expand...

memory bandwidth on the GPU is fast. This is transfering between the CPU and GPU. 30 GB/second is 500 mbyte per frame.(60 fps) You can see it will take time to move data. Ideally you compute the data that will stay on the GPU on the GPU.

Arowx · Apr 5, 2018

LennartJohansen said: ↑

memory bandwidth on the GPU is fast. This is transfering between the CPU and GPU. 30 GB/second is 500 mbyte per frame.(60 fps) You can see it will take time to move data. Ideally you compute the data that will stay on the GPU on the GPU.
Click to expand...

My point is this is about the same speed as CPU to/from RAM, why is this classed as slow?

PlayingPolicy · Apr 7, 2018

Arowx said: ↑

My point is this is about the same speed as CPU to/from RAM, why is this classed as slow?
Click to expand...

It's not. Modern CPU <-> GPU communication throughput is rarely a bottleneck worth considering; certainly not in this thread's context. Maybe it comes up in AAA games with streaming in massive textures and stuff.

The real problem isn't throughput, but latency. When you read data on the CPU or GPU prepared by the other unit, it's around 1-3 frames old. So to split your game simulation between the two, you have an enormous engineering challenge to make it work correctly. The most advanced example I know of is the latest (experimental) incarnation of Nvidia Physx, where they split different aspects of the rigid body simulation between CPU and GPU (there are slides for that thing somewhere). Note his mention that it took Nvidia, a company with bottomless pockets for R&D, 3 attempts to get it right. Probably not the sort of thing you can afford to delve into if you're an indie.

So the only practical option is to run the entire game on the GPU. That's doable, but in all my tests I've found the hit to graphics headroom intolerable (as elbows mentions, game reviewers find the same). The best option really is just to run the game on the CPU, hence the broad lack of interest in pure GPU solutions.

Another factor is that even low-cost PCs and laptops these days have decent CPUs.

elbows said: ↑

Plus, although I'm not knowledgeable enough to accurately convey the detail to other people, there are enough architectural differences between computing on the GPU and the CPU that I'm not sure it is really worth trying, the sacrifices may easily outweigh any gains. Latency differences. memory differences, difference in number of parallel 'threads' available, these things add up and tend to make CPus and GPUs suited to different tasks, some overlap but not as much as some probably think.
Click to expand...

Today, there's no real qualitative difference in the kind of computations suited to CPU vs GPU. On either unit, you need lots of parallelizable work to attain reasonable utilization of the hardware resources. Just how much work is the chief difference. On CPU you can get away with fewer, "chunkier" units of work. On GPU, you need an enormous number of fine-grained work units in flight at one time to saturate the hardware. Obviously the details of programming for either are substantially different, too.

korzen303 · Apr 9, 2018

Maybe you guys will find the "D3D Async compute for physics" presentation by NVidia useful:
http://developer.download.nvidia.co...ZL7e0FrsEKKQV_h6ZnHm7EmYRKrIWtdWYLE3AnKx2Ynlw

elbows · Apr 9, 2018

korzen303 said: ↑

Maybe you guys will find the "D3D Async compute for physics" presentation by NVidia useful:
http://developer.download.nvidia.co...ZL7e0FrsEKKQV_h6ZnHm7EmYRKrIWtdWYLE3AnKx2Ynlw
Click to expand...

I'd forgotten how nice and clear that one was! And it certainly touches on a number of the issues we've mentioned in this thread. Its especially handy that they frame the whole thing in terms of the limited gpu budget that VFX are likely to be given in most projects, and using async compute to get more bang for their buck.

It doesnt change my opinion about the folly of trying to use the GPU for broader ECS & Burst things in Unity though. Because the detail in documents like these shows just how many decisions need to be taken with care at every stage to get the balance right in terms of performance, functionality and optimal GPU use. And all that despite the fact they've chosen very carefully what system it even makes sense to consider using the GPU for (in this case particle based physics).I never say never about a lot of things though, so I'm not ruling possibilities on this front out forever more, but I wouldnt like to predict when it might seem more sensible or doable. Especially as plenty keeps arriving to put pressure on GPUs, be it VR, 4K, new expectations about realtime raytracing one day, etc.

In the meantime, I'm just eager to see the modern Unity systems mature and other ones arrive (such as the GPU VFX via node graphs thing). And I say all this as someone who does have the luxury of spending a huge chunk of gpu budget on effects, simulations and physics for arty stuff, on hardware I can spec myself. So I love to use absurd amounts of gpu power for single, fanciful tasks. But I'm more than happy to make such systems with the GPU in mind specifically, hybrids might be interesting one day but right now it'd just feel like muddying the waters to me. Maybe in the future of Unity when more of these systems have node graph ways of writing for them, some forms of overlap will be easier to ponder and consolidate, on the tools side if not the underlying code realities.

xoofx · Apr 10, 2018

Hi,
We have definitely a plan behind `burst` to project IL to GPU (either for GPGPU or directly shaders) and provide a seamlessly integration, but it will take time to get there. We had even a prototype of burst being projected to HLSL shaders last year during Unity HackWeek...

The main interest in this scenario would be to rely on the integrated GPUs that are often parts of existing processors but are actually not used. It would provide a significant computing power in addition to the CPU, and with a proper shared memory in place between the integrated GPU and CPU, this could handle very efficiently part of the ECS workload.

Though, we haven't really started on this, so you will have to be patient!

elbows · Apr 10, 2018

Ah that scenario makes sense, thanks for the info!

MadeFromPolygons · Apr 10, 2018

hippocoder said: ↑

In a game development context, the GPU is usually used to offload calculations that will remain on GPU for example per-triangle packing and culling is done with a compute shader on Frostbite. Things like drawing vegetation and culling or particle systems are also real cpu-and-gpu saving things.

Most modern game engine designs identify what problems can be shifted to compute. This alone is the CPU saving you are looking for. Mixing both would only be a win for a very specific design.

ECS is a general purpose design that will work on all hardware regardless of the GPU. Consider the scenario where you want ECS to do more work because the GPU is already maxed out for calculations.

ALU limits on GPU are quite generous but memory bandwidth is still far too low. And we want to minimise communication between GPU and CPU for current GPU architecture and hardware. This means offloading whole problems to the GPU with the idea that the results are really only needed by the GPU. This means mixing it with ECS would probably force both to be weaker at present.

Obviously just thoughts as I've not done any research at all, but from what I know the current behaviour of offloading complete problems to compute that don't need to be read back is the best use of resources for most things.

ECS is a genius design coupled with jobs and networking. Imagine networking waiting for a GPU which will stall if you request immediate results?
Click to expand...

So lay-mans rule of thumb for any newer members (this wont be correct 100% of the time but its basic enough to understand):

Basically, use compute to optimize anything that is sending data back and forth to GPU and easy enough / makes sense to run in parallel.

For everything else, there is jobs and ECS which essentially is just leveraging threaded access and threaded modification of data, where it couldnt be leveraged before.

hippocoder · Apr 10, 2018

Well if you request data immediately, the GPU has to stop all of it's jobs in flight and do your request, this is a huge performance loss, but you can request the information back when the GPU is ready, to play nice and avoid the stall. This is what other people in this thread refer to as latency, and it will be different depending on GPU and driver. So, this is going to be at least 1 frame out of sync and trying to get that to work with ECS is challenging which has similar issues of it's own in this case.

So I think both can work together on the same problem providing the problem can be divided, and you do not require specific sync points or need the data immediately.

Like you, I'm just a layman in this stuff. I Defer to brighter minds on this forum

Search Unity

Is there any research being done by UT on Burst to GPU assembly generation?

Arowx

elbows

elbows

elbows

hippocoder

Digital Ape

elbows

Arowx

Arowx

LennartJohansen

Arowx

PlayingPolicy

korzen303

elbows

xoofx

Unity Technologies

elbows

MadeFromPolygons

hippocoder

Digital Ape

Search Unity

Unity ID

Useful Searches

Is there any research being done by UT on Burst to GPU assembly generation?

Digital Ape

Unity Technologies

Digital Ape