Search Unity

  1. Unity 6 Preview is now available. To find out what's new, have a look at our Unity 6 Preview blog post.
    Dismiss Notice
  2. Unity is excited to announce that we will be collaborating with TheXPlace for a summer game jam from June 13 - June 19. Learn more.
    Dismiss Notice
  3. Dismiss Notice

GPU Instancing

Discussion in '5.4 Beta' started by djarcas, Dec 29, 2015.

  1. elbows

    elbows

    Joined:
    Nov 28, 2009
    Posts:
    2,502
    Have you tried switching the new Graphics Jobs thing on now that it is apparently present in this beta version?

    Because unless I'm really confused its one of the things shown off at the recent VR/AR conference which looks to improve Unity performance notably. So I'd love to know what it does to your numbers, or whether its already on for the numbers you gave.
     
  2. djarcas

    djarcas

    Joined:
    Nov 15, 2012
    Posts:
    248
    Didn't even see that option! Turning Graphics Jobs (experimental) on give nearly identical results. Perhaps a tad cheaper, but not to the point where I could safely say it was making any difference.
     
  3. elbows

    elbows

    Joined:
    Nov 28, 2009
    Posts:
    2,502
    Yeah I got round to trying it myself last night and it didn't make much difference to a simple scene with thousands of instantiated objects. Instancing seems to do well though :)

    Some numbers:

    20,000 unity cubes with colliders removed and shadow casting off. One directional light. Standard shader (and instanced version of standard shader given in this thread).

    No batching or instancing: CPU 29ms Render thread 16.8ms
    Dynamic batching: CPU 21ms Render thread 8.5ms
    Instancing: CPU 17.1ms Render thread 2.3ms

    Yay, nice work team Unity :)
     
    hippocoder likes this.
  4. djarcas

    djarcas

    Joined:
    Nov 15, 2012
    Posts:
    248
    Unfortunately, 18ms to draw 20,000 cubes isn't particularly brilliant. Is there going to be a solution to push lots and lots of things to the GPU without them necessarily being Unity objects? My game knows where all my conveyor belts are; in XNA, you'd just maintain a huge list of where they were, and push that to the GPU with next to no CPU overhead.
     
    elias_t and hippocoder like this.
  5. elbows

    elbows

    Joined:
    Nov 28, 2009
    Posts:
    2,502
    There are ways if you are prepared to do some of the heavy lifting yourself.

    For example I have played with Kvant Spray mesh particle system that someone from Unity Japan wrote, and it has its own way of combining objects into single instances.

    https://github.com/keijiro/KvantSpray

    They have some other systems that may be even more relevant to you but I haven't played with those as much. e.g. https://github.com/keijiro/KvantWall
     
    Last edited: Feb 18, 2016
  6. sinems

    sinems

    Joined:
    Feb 18, 2016
    Posts:
    1
    thank you, it is true
     
  7. hippocoder

    hippocoder

    Digital Ape

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    I think what would be optimal is passing an array of InstancedMesh to DrawInstancedMeshes or such. Build a fat list and pass it for rendering, any plans for something like this, Unity? cut out all the chaff, so it becomes properly viable for things like grass.

    I did have high hopes for using it for grass and lots of little things but as it stands, it's not really the superstar it could be. It's nice, it helps, but it could go a lot further.

    Edit:
    AFAIK since this is shader based, DrawMesh should work fine for individual elements without the gameobject/transform/hierarchy overhead. Anyone tried?
     
    Last edited: Feb 18, 2016
    elias_t likes this.
  8. elbows

    elbows

    Joined:
    Nov 28, 2009
    Posts:
    2,502
    Not yet, because I didn't have much time the other day and just needed to try something very quickly to see it working in a basic way. The numbers I posted were really just to prove it was working in a noticeable way and what performance could be gained on that front alone, not taking into account other bottlenecks.
     
  9. elbows

    elbows

    Joined:
    Nov 28, 2009
    Posts:
    2,502
    SpeedTree stuff is what interests me most personally for this feature, any idea how far away we are from seeing some support for that?
     
    AdamGoodrich likes this.
  10. alexandre-fiset

    alexandre-fiset

    Joined:
    Mar 19, 2012
    Posts:
    717
    + 1 for SpeedTree GPU instancing. This would be a massive gain as it is about 60% of our draw calls. Streaming textures would also be really helpful.
     
    AdamGoodrich likes this.
  11. mgooding

    mgooding

    Joined:
    Mar 6, 2014
    Posts:
    10
    I tried this a few days ago. Unity unfortunately crashes with an error about insufficient RenderNodeQueue memory with ~5000 Graphics.DrawMesh calls.

    In my experience, the current limitation for the conventional path is dynamic batching occurring before instancing - you can actually draw more 900 vertex objects than cubes in a frame.
     
    hippocoder likes this.
  12. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    5,203
    We found the source of the issue, it was a regression introduced by the multithreaded rendering refactoring work in b5 or b6. It will be fixed in b7. Thank you for your patience.
     
    Last edited: Feb 20, 2016
  13. djarcas

    djarcas

    Joined:
    Nov 15, 2012
    Posts:
    248
    Is this likely to be the reason why most of the CPU time (when using Graphics.DrawMesh) is spend in JobAlloc.Overflow?
     
  14. djarcas

    djarcas

    Joined:
    Nov 15, 2012
    Posts:
    248
    @mgooding - is there a trick to this I'm missing?

    for (int i=0;i<maPositions.Length;i++)
    {
    Graphics.DrawMesh(lMesh,maPositions,Quaternion.identity,lMat,0);
    }

    Doing that loop 4096 times takes 7.26ms - I can't really see how I could optimise it further than this.

    I'll expand this a little further:

    (The meshes I have been using have 88 verts in them, but I have shadows on, so I won't be 'suffering' from any Dynamic Batching)

    4096 DrawMeshes : cpu : 26ms, gpu 5ms (instanced shader)
    4096 DrawMeshes : cpu 27ms, gpu 9.25ms (non-instanced shader)
    4096 GameObjects : 9ms cpu, 4ms gpu (instanced shader)
    4096 GameObjects :17ms cpu, 8 ms gpu (non-instanced shader)
     
    Last edited: Feb 21, 2016
    hippocoder likes this.
  15. joe1016zw

    joe1016zw

    Joined:
    Apr 11, 2015
    Posts:
    3
    Last edited: Feb 25, 2016
  16. Lars-Steenhoff

    Lars-Steenhoff

    Joined:
    Aug 7, 2007
    Posts:
    3,549
    Will instancing support come to the new Apple tv?
     
    MrEsquire likes this.
  17. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Is there not an API call that sends an array or list of meshes to draw?
     
  18. djarcas

    djarcas

    Joined:
    Nov 15, 2012
    Posts:
    248
    Without using up a S***ton of CPU time with CombineMesh()? Not that I'm aware of. I hope I'm wrong tho.
     
  19. elbows

    elbows

    Joined:
    Nov 28, 2009
    Posts:
    2,502
    I dont suppose there is any news as to when to expect instanced speedtree shader? Is it still supposed to happen during the beta phase?

    Many thanks.
     
  20. zeroyao

    zeroyao

    Unity Technologies

    Joined:
    Mar 28, 2013
    Posts:
    169
    Sounds like an API that allows your to draw instanced meshes with an array of matrices would be ideal. How would you like it to be if you want more instanced data for each, e.g. color?
     
  21. zeroyao

    zeroyao

    Unity Technologies

    Joined:
    Mar 28, 2013
    Posts:
    169
    With latest b9 instancing is decoupled from the "dynamic batching" option and always is enabled on capable platforms. Also instancing takes the priority over static/dynamic batching if the shader allows.
     
    mgooding likes this.
  22. zeroyao

    zeroyao

    Unity Technologies

    Joined:
    Mar 28, 2013
    Posts:
    169
    And to answer generally, in 5.4 instancing works only on desktops (Windows, Mac and Linux with D3D11/D3D12/GL4.1).
    Apple Metal support is being worked on for 5.5.
    As of SpeedTree instancing, we are fully aware it might be the most useful case for instancing. It works and performance get improved in demos, but that requires some changes to the instancing implementation. We are working on that right now. Stay tuned.
     
  23. Reanimate_L

    Reanimate_L

    Joined:
    Oct 10, 2009
    Posts:
    2,788
    Don't forget for terrain detail objects too. except you guys already have a plan for new terrain system :D
     
  24. Hyp-X

    Hyp-X

    Joined:
    Jun 24, 2015
    Posts:
    439
    I'm not the original poster but it would be definitely useful.
    As for additional parameters, 5.4 already support arrays in MaterialPropertyBlock so we can index into those from the shader with the instance id.
    Support for passing the SpeedTreeWind CBUFFER to the shader using this DrawMesh API would also be good because it would make it possible to render (procedural) vegetation.
     
    mgooding likes this.
  25. Sebioff

    Sebioff

    Joined:
    Dec 22, 2013
    Posts:
    218
    Hey!
    Currently testing instancing on Mac (beta 8) by simply spawning a couple thousand spheres, and rendering takes ~50% longer than in 5.3. Profiler says this:


    A comment in the instancing word docu says that it's not always faster in all cases - is this such a case or is that some other problem?
     
  26. yuanxing_cai

    yuanxing_cai

    Unity Technologies

    Joined:
    Sep 26, 2014
    Posts:
    335
    Did you make changes to your shaders to enable instancing? Simply instantiating multiple identical objects with default shaders will not work.
     
  27. kite3h

    kite3h

    Joined:
    Aug 27, 2012
    Posts:
    197
    GPU instancing is not supported in lightprove blend object. WIP or never supported?
     
  28. Sebioff

    Sebioff

    Joined:
    Dec 22, 2013
    Posts:
    218
    Yes, I created an instanced standard shader from the Create -> Shader -> Standard Surface Shader (Instanced) menu and used that (also made sure the source contained the #pragma instruction). All draw calls were listed as batched using Instancing in the Rendering profiler.
    Tested on a 2014 Mac Mini with Intel Iris GPU.
     
  29. yuanxing_cai

    yuanxing_cai

    Unity Technologies

    Joined:
    Sep 26, 2014
    Posts:
    335
    Hmm. That's peculiar. Can you send me your project and let us test it on our hardware?
     
  30. zeroyao

    zeroyao

    Unity Technologies

    Joined:
    Mar 28, 2013
    Posts:
    169
    Yeah please send us your project and we will take a look asap as we are doing performance tests and improvements on instancing these days. Actually any instancing-heavy scenes are more than welcome!
     
  31. Sebioff

    Sebioff

    Joined:
    Dec 22, 2013
    Posts:
    218
    Sure! Submitted as case #776897. Let me know if there's any other data or information you might need and I'll provide it asap.
     
  32. zeroyao

    zeroyao

    Unity Technologies

    Joined:
    Mar 28, 2013
    Posts:
    169
    I have tried your scene on my machine. Instancing works 3x faster on b8 than 5.3.3. I found it hard to believe that not instancing these spheres renders faster than instanced...your scene will very much bound by CPU performance on modern desktop hardware, and in non-instanced path Unity actually have to submit the 10 thousand draw calls since the sphere has more than 300 verts and can't be even dynamically batched.

    Can you do a comparison on b8, with and without instancing? And the readings of main thread time, render thread time, and Gfx.WaitForPresent (from Profiler/CPU Usage/Hierarchy view) please?

    I'll ask our QA to test out on some old hardwares with intel graphics next week.
     
  33. Sebioff

    Sebioff

    Joined:
    Dec 22, 2013
    Posts:
    218
    Yep, used the spheres to specifically test against the worst case of no dynamic batching.
    The CPU in this machine is a 2.6GHz i5 (Dual Core).
    Here are my numbers (most of them also included in the report). All using b8.

    Without Instancing:
    Running in the editor
    Numbers read from the stats gizmo:
    Main thread: ~80ms
    Render thread: ~45ms

    Running in a standalone player:
    profilerNonInstanced.png

    Gfx.WaitForPresent: fluctuating a lot between 20-35ms
    Gfx.ProcessCommands: ~70ms

    profilerNonInstancedHierarchy.png

    ======================================================

    With Instancing:

    Running in the editor
    Numbers read from the stats gizmo:
    Main thread: ~130ms
    Render thread: ~10ms

    Running in a standalone player:
    profilerInstanced.png

    Gfx.WaitForPresent: fluctuating between 75-100ms, usually in the 80s
    Gfx.ProcessCommands: relatively stable ~100ms

    profilerInstancedHierarchy.png
     
    Last edited: Mar 5, 2016
  34. zeroyao

    zeroyao

    Unity Technologies

    Joined:
    Mar 28, 2013
    Posts:
    169
    In your case it looks like instancing incurs a big GPU performance penalty (your CPU is mostly waiting for GPU to finish rendering 10000 instanced spheres). This is exactly the case Yuanxing was previously taking about, that the overhead of reading instance data from uniform arrays outweighs the gain from less CPU time, because your scene is not bound by CPU time anyway. Generally in such a case i.e. more than several millisecs on Gfx.WaitForPresent, be cautious to turn to GPU instancing.

    Usually the overhead is small but it can be tricky on some particular platform/driver (we don't have much test coverage on Mac/GL yet), I'd guess your hardware is one of them. We'll try find the same hardware and do more profilings.
     
  35. hippocoder

    hippocoder

    Digital Ape

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    What ways are there to reduce the impact from GPU side? smaller data passed? For example, 10,000 grass quads is nothing for optimised code, but in this case it sounds as though it may be throwing the horse and cart at the GPU with redundant mesh properties?

    Also will there be API for "drawing lots of things similar" ie pass an array instead of overhead from transforms and even overhead from constant drawmesh calls?

    Thanks for reading.
     
  36. Sebioff

    Sebioff

    Joined:
    Dec 22, 2013
    Posts:
    218
    Ah! That makes a lot of sense, thanks. So in my real project that is bound by CPU time I might see a speedup since I can move some work over to the GPU which is currently not being too busy.
    Could that be....automatically balanced somehow in the future so that Unity tries to put an equal amount of load on the CPU and GPU? Since PCs vary so much I can't possibly find the ideal balance myself as that'll be different for every hardware configuration or? It also depends on the scene.
    Obviously that isn't the goal for this initial release, just trying to figure out where this might be going :)
     
    Last edited: Mar 5, 2016
  37. zeroyao

    zeroyao

    Unity Technologies

    Joined:
    Mar 28, 2013
    Posts:
    169
    Reduce instanced properties is a good idea. Currently we have built-in matrix arrays of ObjectToWorld and WorldToObject, that is always used (hmm I'm not sure where WorldToObject is used, need to figure out). If you are able to create a looking of variety with techniques like some random mathematics in your shader, it will be preferred over using instanced properties.

    Unity's current renderloop generally is not so awesome at dealing with huge amounts of GameObjects. Our next goal is to create a component that allows drawing instances fast and cheap, avoiding all the overhead of using GameObjects. As for 5.4 I think making a simple instanced draw API is very doable and can solve the problem in a programmatic way.
     
    landon912, sqallpl, mgooding and 5 others like this.
  38. hippocoder

    hippocoder

    Digital Ape

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    Sounds really brilliant, thank you for your hard work (and the guys who work with you) :)
     
    sqallpl likes this.
  39. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Would it be possible to build an RTS with thousands of Units with Instancing?
     
  40. djarcas

    djarcas

    Joined:
    Nov 15, 2012
    Posts:
    248
    100% absolutely YES. DrawArrayOfThisMeshAndThisMaterialWithThisArrayofMPBs would be *perfect*
     
  41. zeroyao

    zeroyao

    Unity Technologies

    Joined:
    Mar 28, 2013
    Posts:
    169
    Any real use cases of creating arrays of MPBs against a single MPB that contains arrays of your instanced data?
     
  42. djarcas

    djarcas

    Joined:
    Nov 15, 2012
    Posts:
    248
    In FortressCraft, I have plenty of machines that are identical meshes, but indicate their status via alteration of a GlowMultiplier via MBP.

    https://www.humblebundle.com/misc/files/hashed/ea96fb05d131c5b935e50ef43bf4cf0a3c7f8830.jpg
    https://www.humblebundle.com/misc/files/hashed/7c5565a1cf1d6508370fe42199dbdd72e6655a4d.jpg

    How that's authored doesn't bother me; but 'Render X game objects from this array of MPBs, which contains the rotational and transformational data' would be fine.

    (It didn't cross my mind that you could include positional offets in an MPB until now)
     
  43. sqallpl

    sqallpl

    Joined:
    Oct 22, 2013
    Posts:
    384
    Great to hear that.

    How is the terrain detail/grass handled at the moment btw? It's possible to spawn millions of instances in one scene and it runs fine.
     
  44. Roni92pl

    Roni92pl

    Joined:
    Jun 2, 2015
    Posts:
    396
    Grass(terrain details) already can be batched to tiles as big as possible (65k), using detail resolution and detail resolution per patch in terrain settings, and afaik it's not done in runtime so it's very performant cpu-wise imo.
     
    sqallpl likes this.
  45. Cripplette

    Cripplette

    Joined:
    Sep 18, 2012
    Posts:
    66
    Any news of GPU instancing for mobile ?

    I am not really a low level graphics dev, so I don't know the limitations of it. But do you think that GPU instancing will be available for all mobiles or there are limitation in OpenGLES, GPUs types etc ... ?

    Cheers
     
  46. Hyp-X

    Hyp-X

    Joined:
    Jun 24, 2015
    Posts:
    439
    Yes, please.
    And please make wind work with the new solution (both preferably).

    We want to use it for procedural vegetation, but don't want to use the terrain's system (We want 3d rotation, easier destruction, and - if possible - better performance)
     
  47. hippocoder

    hippocoder

    Digital Ape

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    Well your "wind" is handled by the shader, I don't think instancing has much to do with it?
     
  48. tjscott

    tjscott

    Joined:
    Sep 22, 2014
    Posts:
    12
    Is there a way to get instancing to work on terrain grass? Or do I have to place grass manually with an instance shader for this to work?
     
  49. Hyp-X

    Hyp-X

    Joined:
    Jun 24, 2015
    Posts:
    439
    Yes, except it needs some input constants (see at the top of SpeedTreeWind.cginc).
    It's similar to lighting - it is handled by the shader but light dir, light color constants are set by Unity.
    But Graphics.DrawMesh doesn't set those constants like MeshRenderer does.
    I reported a bug, but QA closed it saying that the documentation doesn't explicitly say that it should work...

    So it's not related to instancing when it's done by regular MeshRenderers, but it's relevant when we are talking about a new API to directly draw meshes. (The post which I replied to was about that.)

    Sorry if it wasn't clear.
     
  50. pointcache

    pointcache

    Joined:
    Sep 22, 2012
    Posts:
    579
    Im writing a tile based terrain renderer, and in a week i was doing it i tried all ways to otimize 8k tiles on screen.
    video:


    and details :
    i tried all approaches i could to sqeeze out max performance. what you see on video is using DrawMesh with command buffer, which proven to be the worst as you have to do a lot of costly maxtrix transformations that for some reason do string copy internally as profiler says.
    The initial approach was to have chunks, that regenerate meshes, as game objects in scene. The performance was ok, but the amount of additional overengineering in code and the load on scene turn me away in search for better solution.
    Right now the best performance is done with the use of GL.Begin(GL.QUADS)
    Stupidly iterating through tiles and sending gl commands, so that you end up with a batch of commands per material, is fastest, so at 8k tiles i can see as much as 150~ fps. And that is bad, because i have core i4770 and gtx980,
    and i would want people to play and use without such rig.
    On top of this, this approach that is hooked to OnPostRender(on pre render does not show anything) will force me to manually control render order with additional cameras, but i would go for it with no problems actually.

    Now WHY im overcomplicating this, with the way i want tiles to be rendered
    1. tiles should have transitions
    2. transitions are generated automatically, only few guiding masks are provided, the rest is done on runtime
    3. map is a decoupled data structure that is not dependent on scene objects
    4. splat map is ugly, not acceptable
    5. baking textures (i tried this) proves to be problematic as raw amount of texture data is insanely high, and you cant modify compressed textures
    6. tiles are dynamic

    problems:
    1. atlases are useful only to partially cut drawcalls, which are not an issue anyway
    2. even if using atlases the only real solution to uv'ing tiles it to have them a separate uv island, which requires separate submesh, again - no benefit
    3. material per tile TYPE (thus any newly rendered transition texture receives its separate material), and it is combined into one GL.Begin per material

    i may forgot a lot to clearer explain issues, but! just trust me on that :)
    Now my point is:

    im here seeking for optimization of that renderer. I wan't to try instancing as it MAY help.
    However, it for some reason as i can see from this thread requires to use a game object.
    And as you know GO's create overhead even with disabled rendering, they are just too heavy to be used per tile.
    If i were to make chunks, its useless again as all chunks would be unique.

    So my last hope would be that i can send a batch command in code, avoiding scene.
    It would be amazing if i could make a command like
    Code (CSharp):
    1.  
    2.  
    3. InstanceCommand IC = new InstanceCommand();
    4.  
    5. IC.mesh = mesh;
    6. IC.instances = new List<Vector3>() { v1, v2, v3, v4} ;
    7. IC.material = mat;
    8.  
    9. Graphics.ProcessInstanceCommand(CameraEvent.AfterSkybox, IC);
    10.  
    11.  
    aaaaaaand thats it.
    And it would send the mesh to GPU, and then tell gpu where i want that mesh. Is that possible?
    I really want to hear that there is a solution that will boost fps up to at least 500, or something where on relatively old pc there is at least 50-60 fps.

    Thanks for reading this wall of text.

    Just tell me in short, can i expect this or not, i will switch to other tasks but this is fundamental for this project.

    An alternative idea would be to have a Proxy object, its a simple GO with transform and Mesh proxy component, where you select mesh/material but its very lightweight, NOT rendered in scene AT ALL and they are not even serialized as full objects, only as reference to the first clone.
    That is how most 3d packages work

     
    Last edited: Mar 12, 2016