Search Unity

  1. Improved Prefab workflow (includes Nested Prefabs!), 2D isometric Tilemap and more! Get the 2018.3 Beta now.
    Dismiss Notice
  2. The Unity Pro & Visual Studio Professional Bundle gives you the tools you need to develop faster & collaborate more efficiently. Learn more.
    Dismiss Notice
  3. Let us know a bit about your interests, and if you'd like to become more directly involved. Take our survey!
    Dismiss Notice
  4. Improve your Unity skills with a certified instructor in a private, interactive classroom. Watch the overview now.
    Dismiss Notice
  5. Want to see the most recent patch releases? Take a peek at the patch release page.
    Dismiss Notice

Experiments with Instancing and other methods to render massive numbers of skinned Meshes

Discussion in 'General Graphics' started by Noisecrime, Dec 23, 2016.

  1. Noisecrime

    Noisecrime

    Joined:
    Apr 7, 2010
    Posts:
    1,487
    Hi,

    For the past week I've been contemplating and experimenting with methods to render massive numbers of skinned meshes in Unity. Initially I was using DrawMeshInstanced, but quickly switched over to DrawMeshInstancedIndirect due to the potential performance benefits it offers. I figured it would be worth writing down my thoughts, findings, progress and bugs as a starting point for anyone else.

    Current progress is illustrated with the video below, which is best viewed fullscreen and using HD quality.



    It shows 10,000 Lerpz skinned meshes, each instance has a unique colour tint ( mostly for debugging, but in future to show that the appearance of each can esily be changed ) and is running its own independent animation, with frame interpolation. The cool thing about DrawMeshInstancedIndirect is that the frustum culling and dynamic LOD selection as well as the instancing itself is all done on the GPU. This saves a huge chunk of cpu time and in the video you can see framerates of between 160 and 250 fps depending on number of instances visible.

    As for potential i've had it running 40,000 instances and manage 30 fps on my GTX970. I'll provide some more details and add my thoughts and findings to the thread later.
     
    Martin_H, Novack, Zaelot and 6 others like this.
  2. Noisecrime

    Noisecrime

    Joined:
    Apr 7, 2010
    Posts:
    1,487
    So as far as I'm aware there are two main methods to feasibly achieve a high number of 'skinned' mesh animation instances in Unity.


    Baked Meshes
    This method involves creating a unique mesh for every frame, of every animation, of every model and every LOD in the game. The you choose the correct frame and use DrawMesh to render it. Even though as a technique it sounds awful from what I've seen this is perfectly capable of achieving a high number of rendered models on a good system.

    Drawbacks
    • Potentially heavy memory requirements depending upon the framerate of your animations, number of models, number of animations and number of LODs.
    • The animation is basically like a 'flip book', as you are simply presenting a static mesh each frame.Granted we've had that in 2D games with sprites for decades, but I'm skeptical that it will feel good when mixed with other none baked animations, physics driven objects, if the framerate is not a multiple of the baked animation rate.
    • Culling and LOD (generally ) still has to be done on the CPU in order to use DrawMesh.

    Positives
    • Interacts as expected with all the rest of the Unity rendering systems with no additional effort.
    • Relatively straightforward to implement.
    • Doesn't tax the GPU.
    • Works for older GPU systems as long as there is enough memory.

    In fact it is entirely possible that the downsides could be address with some clever tricks and pushing more work to the GPU. For example providing the next frame mesh data with the current mesh, so you could interpolate between the two something akin to the old MD2 Quake format in the vertex shader, should be feasible.



    Skinned Instanced Meshes
    Generally a more complex method as it requires implementing you're own skinning on the GPU as well as supporting instancing and dealing with a bunch of other stuff. This requires extracting all the bone animation data and passing it to the GPU, which can then be used with custom shaders along with instance ID to render the mesh and animate it with Bone Matrix palette skining. There is a good example/source for this on the nvidia website and in GPU gems 3.

    Thankfully with ComputeBuffers you no longer have to pass the data via textures like they did in 2007. Though everything I've read implies that using textures and bilinear interpolation can automatic provide inter-frame interpolation. I'm not sure about this as my understanding is you cannot simply lerp two matrices. The positional part should be fine, but weird things are going to happen to the rotations. I'm going to have to give it a try sometime though and see as one of the biggest drawbacks of this method is that it will make your GPU cry, due to the amount of effort placed on the vertex shader.

    Drawbacks

    • It will use every ounce of your GPU power. The bone matrix palette skinning and all the look ups required is a constant overhead and its per vertex! This is what makes LOD so important, as every vertex saved means saving considerable processing time in the vertex shader.
    • Doesn't always play nice with Unity rendering systems due to instancing. I think there might be a number of gotcha's coming up with this, such as supporting lightprobes, forward rendering not working with multiple lights. I know there is a bug in the demo currently for shadows where they no longer respect the positions of the drawn instances. I think this is shader related, as i'm sure it was working fine before adding frustum culling method or LOD.
    • Shadows are an issue, as they require rendering the instance again or with cascades maybe several times. Since as stated the bottleneck of this technique is the vertex skinning, that will become amplified. Essentially every time you render the instance again it will halve your framerate. This could be alleviated with 'streamout' where you can store the shader resultant geometry on the GPU, which AFIAK is what Unity uses for its own GPU skinning. However due to the shear number of instances being rendered this would be prohibitive in this case and worse than Baked Meshes in terms of memory requirements.
    • Forward Rendering has the same issue as shadows, every light requires an additional add-pass, which just becomes prohibitively expensive as you are running all the vertex calculations again. However so far my experience has been forward rendering with multiple lights is just broken and even if it worked according to the docs the add-pass instances would be rendered normally instead of instanced. It might just be feasible though if we can build on the custom shader provided by Vavle for its VR LabRenderer, which I believe supports quite a few lights and shadows in forward rendering without using the add-pass technique.

    Positives
    • Greatly reduce amount of data to store on the GPUdue to frame interpolation. In the above demo the animation is stored at just 10 fps, however that can easily be increased to say 30 fps and still only take a fraction of the storage that baked meshes would.
    • Can easily off-load frustum culling and LOD selection to the GPU which can save a good chunk of cpu time. In addition I want to add per instance depth sorting to minimize overdraw ( not sure how much of a win in deferred that will be ). Taking it further you could even drive the entire crowd on the GPU using simulation.
    • To a degree its easily scalable to your hardware, simple to adjust number of instances, use lower vertex count models, dynamically change LOD settings etc.
    • Its even possible to drive the instances via Mecanim animator, though not possible to have an animator per instance, not even close and performance will suffer.




    Driven by Mecanim
    Its possible to drive the skinned instance method via Mecanim, but it cannot have each instance using a individual animator/animation.

    Mecanim is pretty amazing but it has a reasonably large overhead, an overhead that is considerably worse when not being able to use the 'optimize gameObjects' option. That option cannot be used as currently the only way to get the animating bone data is to fetch the transforms of each bone. If only Animator component could supply an array of Matrix4x4 for each bone instead of driving transforms you could probably double the number of Mecanim animations driving instances. However this would still end up as a fraction of the potential instances that could be rendered.


    Its all rather complex
    Once you have your chosen method up and running, things are still more complex to deal with than normal as the main point of both systems is to completely remove/detach the rendering of instances from Unity's gameObject model.

    Its the gameObject model that can really hammer performance once you scale up to 10,000 or more objects. Modifying the transforms, updating bone transforms etc, it all adds up as an overhead. Both the suggested systems avoid gameObjects per instance and instead should work with arrays of position/rotation data ( matrix4x4 ), but this means its somewhat harder to create a generic system that could easily be plugged into any project and would require the developer to drive their game more via code.


    So Many Possibilities
    Currently i'm undecided as to which method is best or indeed if there even is a best. I suspect each has its place depending upon project requirements. Though both have some serious drawbacks I believe they can be addressed with some lateral thinking and effort.

    Beyond that there is then the consideration of variation. Its all very well rendering 10,000 instances of the same model, but even if the animation of each is independent, they all look the same. Colour tinting on its own isn't enough, so considerable effort will have to be employed to find the most optimal methods of creating variations using the same input data. There are a number of avenues to pursue for this, from a simple instancing of parts ( e.g. different heads, helmets, weapons, clothing ) to more advance concepts such as Valves Left4Dead Gradient Mapping.
     
    XCO, asdzxcv777, Novack and 5 others like this.
  3. pld

    pld

    Joined:
    Dec 12, 2014
    Posts:
    4
    Great post. My .02 on this:

    But you can! Hand-wavy proof: you can take the derivative of a matrix-valued function; as long as your derivative is relatively well behaved, you can pretend that the derivative is constant (for small time values). This is the idea behind nlerp.

    Another hand-wavy proof: think of what's going to happen to your basis vectors. For small rotations, they'll remain mostly orthogonal, and mostly unit-length. For larger rotations you'll start seeing more issues.
     
    Noisecrime likes this.
  4. Noisecrime

    Noisecrime

    Joined:
    Apr 7, 2010
    Posts:
    1,487
    Yeah for simplicity I just lerped two matrices during testing and didn't notice anything horrendous happening. I guess maybe the fact that its always lerping between two fixed matrices help ( i.e. not accumulative) and that generally the keyframes are pretty close that it remains well enough behaved.
     
  5. Afif-Faris

    Afif-Faris

    Joined:
    Oct 11, 2013
    Posts:
    16
    Whoa thats awesome !
    Any chance you will share the unity source project from the video?? I want learn how you did gpu instancing and the lod system.

    I am not sure if you already know this, but its related.
    This game made in unity, 25k animated characters in real time battle and the performance is amazing. I don't know how they did it.
     
  6. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    5,004
    How are you reading the animation data? Are you parsing files on the disk directly, or do you have some method of reading the curves from animation clips directly.
     
  7. Noisecrime

    Noisecrime

    Joined:
    Apr 7, 2010
    Posts:
    1,487
    At some point I may release the code or put it up on the asset store. The problem is that so far it is very specific and not general enough to simply be plugged into any project.

    The EpicBattleSimulator is or at least was based around using baked meshes and using DrawMesh in order to acheive its impressive performance. The developer spoke about this in a thread on the unity forums that was discussing in general terms producing a RTS in Unity with many thousands of units - a google search should find it.

    The key point is that having tens of thousands of gameobjects in Unity will never perform adequately as its not designed to. Therefore instead of having each unit defined as a gameObject you would instead create a class that deals with rendering the models via graphics.DrawMesh() instead. The use of baked meshes means you can avoid the expensive cpu bone/vertex animation costs as you simply switch between meshes like a sort of flipbook.
     
  8. Noisecrime

    Noisecrime

    Joined:
    Apr 7, 2010
    Posts:
    1,487
    I simply created my own animation format ( a Unity scriptableObject as a container for the data), that extracted the matrices of each bone for each frame of animation. This data is then supplied as a matrix array via a compute buffer so that gpu skinning can be used for each instance and each instance can have its own frame index into the animation data.

    Its pretty simple to extract the animation data from legacy animations via animationclip.sampleAnimation(), mecanim is a bit harder/awkward.
     
    bgolus likes this.
  9. jimCheng

    jimCheng

    Joined:
    Feb 5, 2014
    Posts:
    47
    I've also made a similar opensource project about this post.

    project@github

    screenshot.jpg
     
  10. AndreaBrag

    AndreaBrag

    Joined:
    Jan 30, 2015
    Posts:
    7
    Hey @Noisecrime,

    Great video, I just started looking for rendering techniques in order to achieve decent frame rates with a couple thousands of units, GPU skinning surely looks like worth a shot!
    I couldn´t find much resources online regarding DrawMeshInstanced/DrawMeshInstancedIndirect and the overall topic, would you mind answering some questions?

    - What´s the actual difference between DrawMeshInstanced and DrawMeshInstancedIndirect?
    - Is there some documentation you can share on how to achieve this result?
    - What is, in your opinion, the best way to extract bones animation´s data (API wise)? What about storing them in a texture using RGB as coordinates as they showed in a Unite16 video?

    Thank you for your time.
     
    Last edited: Mar 28, 2017
  11. dreamerflyer

    dreamerflyer

    Joined:
    Jun 11, 2011
    Posts:
    874
    hi ,i test your demo ,it has some performance issue ,and fail to run in ipamini2.
    gpuskinning bug.jpg bug3.jpg bug2.jpg
     
  12. alfiare

    alfiare

    Joined:
    Feb 10, 2017
    Posts:
    1

    Hi there, so I'm new to all this but I'm loving the instancing.
    I'm attempting to get the animation data out of animation clips and feed it to a compute buffer that can then be read in a vertex shader to transform the object based on the bones. Having some trouble getting the transforms coming from the bones right. I'm doing the SampleAnimation call on the clip and then reading the localToWorldMatrix off the transform on the bones in the SkinnedMeshRender. But when I apply these in the vertex shader the pieces of the mesh all come apart. Maybe I'm just grabbing the wrong thing off the bone transform? How did you get the matrices for the bones from the animation clip sampling?
     
  13. Noisecrime

    Noisecrime

    Joined:
    Apr 7, 2010
    Posts:
    1,487
    If I remember correctly its

    root * bone.localToWorldMatrix * bindposeMatrix

    where
    root = go.transform.localToWorldMatrix.inverse
    and bindposeMatrix is accessed via skinnedMeshRenderer.sharedMesh.bindposes which is an array that matches the bones array.
     
  14. marwi

    marwi

    Joined:
    Aug 13, 2014
    Posts:
    17
    Hello, just found your thread as I've been diving into instanced animation as well since roughly a week now. Here's a short video of some horses running around :)
    Have you used the systems in one of your projects or would you share some more information about your learnings so far?
     
  15. marcatore

    marcatore

    Joined:
    May 22, 2015
    Posts:
    125
    @marwi good examples in your tweets.
    Have you planned to release or share your system?
    I'd like to make something similar to manage in an efficient way a crowd but I'm really noob about how to create.
    Do you know any kind of tutorial, article or something where I could start to study the way to achieve something similar what you did? I know that , probably, you assembled different knowledge from your experience but if you can address me to everything could be helpful, I'll really appreciate.

    Thanks in advance.
     
  16. richardkettlewell

    richardkettlewell

    Unity Technologies

    Joined:
    Sep 9, 2015
    Posts:
    769
  17. marcatore

    marcatore

    Joined:
    May 22, 2015
    Posts:
    125
    @richardkettlewell thank you.
    I've tested it and it seems it's working. Now I should understand how to position the instances where I want.
     
    richardkettlewell likes this.
  18. marwi

    marwi

    Joined:
    Aug 13, 2014
    Posts:
    17
    Hello @marcatore, thanks. I haven't really thought about it yet. The system should be quite useable for managing a crowd actually because of the recent refactoring (decoupled skinning (with or without pushing animation data to gpu), logic (e.g. with compute shaders) and rendering (aka call to graphics.instancing methods)). Do you have a concrete project you would need the system for or is it rather for research/learning purposes?

    I once collected some links related to gpu/shading in a gitlab snippet here: https://gitlab.com/snippets/1671386 maybe this might be useful for you too :)
     
    marcatore and thelebaron like this.
  19. marcatore

    marcatore

    Joined:
    May 22, 2015
    Posts:
    125
    @marwi thank you very much. Really.

    About your question I have a concrete project.
    In few words I'm in a very small team where we're developing a rally sim and we'll have mainly two kind of stage. Classic rally stage with open path and circuit stage with closed path.
    So..the crowd will be less dense in the open path stage and more dense in the closed ones, positioned on the grandstand.
    The target is, due to that we're not doing a crowd simulator :) , to have a quite good animated people to see near the roads that could be not so much heavy to render.
    So I think tha my tools could be searched in a thread like this... Am I in the right way or should I change my view to other? :)

    Thanks in advance for every tips. :)
     
  20. Noisecrime

    Noisecrime

    Joined:
    Apr 7, 2010
    Posts:
    1,487
    sorry am late. Not used the system in project, got side tracked with client work. Alas i'm now recovering from a heart attack, so its unlikely i'll be able to answer any questions or do more work on this for several months. Hopefully other posters here have/can help.
     
  21. hopeful

    hopeful

    Joined:
    Nov 20, 2013
    Posts:
    4,400
    Best wishes for a speedy and comfortable recovery! I know it can't be a great experience - right? - but I hope for you it can be mostly on the better side.
     
    Noisecrime likes this.
  22. marwi

    marwi

    Joined:
    Aug 13, 2014
    Posts:
    17
    No need to apologize, very sorry to hear that! Wish you a fast and good recovery as well!
     
    Noisecrime likes this.