Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice

Showcase Optimization opportunities of skinned mesh rendering system

Discussion in 'Graphics for ECS' started by Rukhanka, Sep 8, 2023.

  1. Rukhanka

    Rukhanka

    Joined:
    Dec 14, 2022
    Posts:
    204
  2. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,252
    It is nice to see someone else also investigating this.

    One thing I'd like to point out is that your 16-bit compression may not be so good once you implement culling. What you've done is cut the compute shader write in half, but have increased the vertex shader read cost by 1.5x. If every mesh went through the compute shader and the vertex shader exactly once, then the total memory bandwidth required doesn't change at all. It is a win here, because there's a lot of wasted compute writes that are never read by the vertex shader. But once culling comes into play, the opposite is true. There will be more vertex shader reads than compute shader writes, because meshes typically get rendered twice (camera and main light shadows). Of course, hardware can be unpredictable sometimes, so it may still be worth profiling.

    Culling and LODs are still going to be way bigger wins than what you are seeing. I have that implemented (including fixing the memory allocation problem) if you want to try it to compare. If you want Unity Transforms, you can use this branch: https://github.com/Dreaming381/Latios-Framework/tree/prerelease/0.8.0 with the scripting define LATIOS_TRANSFORMS_UNITY. You'll need to attach an Animator component to the roots of your skinned meshes with the Animator Controller null if you want to render static skinned poses. Otherwise, the meshes will bake in an attachment mode meant for character customization use cases.

    That fastest approach is using QVVS Transforms (no scripting define) with characters that use the "Optimize Hierarchy" in the rig import settings and have proper LODs.

    One last thing, with culling and no LODs, the performance gap zoomed in versus zoomed out will increase dramatically. But with proper LODs, that should even out again where performance should be constant across the board, and you'll likely become CPU-bound.

    Feel free to make whatever comparisons, criticisms, or anything else you like! While I have to be careful to not infringe on your IP, you don't have to worry about that at all.

    I'm willing to answer any questions anyone has about this topic. I went through this investigation back before we had any real working animation solutions, so my findings didn't get much attention.
     
    tmonestudio likes this.
  3. optimise

    optimise

    Joined:
    Jan 22, 2014
    Posts:
    2,128
    Does your solution support mobile with high performance especially at low end android device?
     
  4. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,252
    Depends on the GPU architecture of Android devices. Some Android devices, it is impossible to obtain high performance at scale with Unity's API, because any dynamic indexing in the vertex shader is slow, so you have to bake the skinning data into a bunch of unique meshes, but you want to share parts of the mesh stream that aren't unique. That mesh stream partial sharing isn't possible in Unity. So as much as I would like to do something about it, I will only be able to do something that covers maybe 80% of the landscape, and it honestly isn't worth it.
     
  5. Rukhanka

    Rukhanka

    Joined:
    Dec 14, 2022
    Posts:
    204
    Sorry, I totally forgot about your research. This is because I thought you have done big refactoring (actually it is) of Unitys ECS libraries. When you do things by different way it barely can be called "optimization" of existing :)

    Allow me to disagree here. With plain UAV reads this can be the case, but original mesh data is input attributes vertex stream. This concept exists for the ages, and optimized for reads (cached vertex fetch optimization) by hardware vendors very well. I don't know very low level behavior of this aspect (maybe it is benefical to read whole specified input stream in cache upfront. Hardly but who knows).

    I have made two modifications of my test program. One with skinned vertex deltas and one without. With HDRP and four shadow cascades that cover whole scene. Depth prepass, GBuffer, and 4 shadow splits rendering there are 6 rende passes in total. I have placed camera in the same same view point and compresset vertex delta works faster then full skinned vertex:
    upload_2023-9-9_14-34-57.png

    upload_2023-9-9_14-35-9.png


    I am completely agree here. High level reducing of processed meshes count will give biggest benefits. I will look at your framework closely. I wish I had done this sooner. What to say if I didn't even look closely at Entities.Graphics deformation implementation.

    I have added "optimize hierarchy" (entities bone stripping) in upcoming v1.4.0 Rukhanka release too.

    I am greatful for such behavior. I am trying to stick same politics.

    I am sorry to hear that (about attention). I really hope this will not prevent you to make great contributions in the future.
    Can I ask about DMotion? Is it abandonned?
     
  6. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,252
    It looks like in your test there are meshes outside of view. Until you have frustum culling on skinning operations, I don't really buy your results.

    I wouldn't be surprised if NVidia still has attribute streaming hardware, but I believe AMD and Apple have both ditched it, similar to how a lot of GPUs have ditched fragment interpolation hardware and replaced it by patching the fragment shader. It is still very prevalent on Android mobile, but that's more of a latency issue than a bandwidth issue.

    My point is that the optimization likely won't be universal once you have culling, and in some cases may be detrimental. But hardware is weird. Vertex skinning is still faster in many cases despite more UAV accesses. Couldn't tell you why.

    It is not politics, it is law. Your IP is protected, mine is intentionally not, because I find protecting my animation tech harmful to my goals. I invite criticism as harsh as the criticism I leave the Unity teams, because different perspectives may lead to better designs in the future.

    Nothing to be sorry about. All I was trying to say is that I've spent a lot of time on this topic long before it became a hot topic, and consequently I have a lot of answers. At this point I've mostly moved on from performance improvements of skinned mesh rendering and have been focusing on other features. That's not to say that there aren't still improvements to be made or that I won't improve it in the future, I just don't have any pressure at the moment to improve it right now. The only performance complaints I ever get these days are Android devices, but there's nothing I can do about that without help from either the GPU hardware manufacturers or Unity themselves.

    The author and I talked a few times, and he was onboard with the Mecanim layer in Kinemation when that opportunity came about. I believe now he is specializing DMotion for his studio's needs, which makes a ton of sense to me knowing what his goals are. I could be totally wrong though.

    I'm very curious how you go about solving culling, especially the buffer allocation part. I spent months on that. If you decide to not want to duplicate effort, I'm open to collab too. I have no real personal incentive to do anything with muscle space, so that will likely remain a differentiating factor for you. NetCode is something I still haven't made my mind up on, but it's next on my list to figure out (assuming my list doesn't change with new opportunities).
     
  7. Rukhanka

    Rukhanka

    Joined:
    Dec 14, 2022
    Posts:
    204
    I have cropped the screenshots. There is identical view to all instances.

    Ah, yes. NVidia is famous for such things.

    I am opened to criticisms too. Full source code is available to buyers, and I recevie suggestions and improvements from time to time. Distribution model is choosen mostly to discipline me. I have obligations to customers. And gives me additional force to continuously improve Rukhanka.

    Thanks for the explanations.

    I am not really dive into culling topic yet. When I will have formed concept, I will gladly share it.
     
  8. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,252
    That's fascinating, but I think you missed my point. My point is that just because I have to avoid your IP and be cautious of what I say, doesn't mean you need to do the same for me. In terms of overall evolution of ECS animation technology, you acting as if the tech I've built is behind a paywall does more harm than good. I'm not saying you are doing that intentionally. I know you aren't. But the reality of the situation is weird and unintuitive.