Search Unity

SkinnedMeshRenderer.BakeMesh() slow performance

Discussion in 'General Discussion' started by BiomotionLab, Feb 10, 2020.

  1. BiomotionLab

    BiomotionLab

    Joined:
    Oct 9, 2018
    Posts:
    11
    I have a mesh with ~7000 vertices, when I call the following code, it seems to take around 15ms according to the profiler, and also measuring it using System.Datetime while running it 100 times.

    Code (CSharp):
    1. Mesh bakedMesh = new Mesh();
    2. skinnedMeshRenderer.BakeMesh(bakedMesh);
    This seems really slow. I'm on a super powerful desktop.

    The reason I'm baking is to get the current position of the mesh's vertices. I don't care about other aspects of the mesh at all.

    Is this normal to be that slow? is there a better way to get the current vertex locations?
     
  2. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,566
    The power of desktop doesn't matter much, because this is done on CPU, on a single core and likely prepares vertex/index buffer just in case. So it won't be quite fast.

    7000 vertices would result relatively large number of operations, in ball park of... hmm....
    ((12 * 4 + 4*3)* 2)* 7000 --> 840000 multiplications alone (roughly same number of additions too) each time you call it.

    12 multiplications for each vertex, done 4 times (4 bone weight), 12 more multiplications to blend vertices together, and the whole thing is done twice, because normals and positions are calculated the same way.

    I'd suggest to try to decide what you're trying to do and seek alternative ways of doing it.
     
  3. BiomotionLab

    BiomotionLab

    Joined:
    Oct 9, 2018
    Posts:
    11

    Ahhh.. As I feared. Too bad it seems to be the only way to get mesh vertices besides basically recalculating the mesh on my own every frame.
     
  4. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,566
    You could try to implement blending yourself, or maybe even feed data into compute shader, then see if it is faster.

    Last time I tried to implement reference blending on CPU with C# in unity the speed wasn't quite high, though, and in case of compute shader it is unclear if the overhead will kill any speedup you get. But it might be worth a try if you really need thsoe vertex positions.

    https://docs.unity3d.com/Manual/class-ComputeShader.html
     
  5. snacktime

    snacktime

    Joined:
    Apr 15, 2013
    Posts:
    3,356
    Ya Skinned meshes are a special case. And just getting the mesh itself every frame is generally considered untenable in games even if it's not a skinned mesh. Whatever your higher level problem is it's highly likely there is a better solution that doesn't require mesh data.
     
  6. BiomotionLab

    BiomotionLab

    Joined:
    Oct 9, 2018
    Posts:
    11
    Thanks, I'll look into manually computing it.
     
  7. cphinton

    cphinton

    Joined:
    Jul 5, 2018
    Posts:
    18
    You have probably moved on by this time... I guess this is for others who find this thread. This is not slow because of 840000 multiplications. On a superscalar OOO 3.5 GHz core, this is less than half a millisecond of ALU. There must be another reason.

    Actually I just tested BakeMesh on a mesh with 4101 vertices in Unity version 2019.4.3f1 and it took ~50 microseconds. Its hard to say what is slow about your case, perhaps you have a crazy number of blend shapes on your mesh? I would try testing a newer version of Unity. If that does not do it, I would test with different meshes. For example, there may be a fast and slow code path for meshes of different formats.
     
  8. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,566
    That is based on assumption that the blending process is optimized to the maximum, which is not the case.
     
  9. ianlostcontrl

    ianlostcontrl

    Joined:
    Sep 18, 2021
    Posts:
    3
    Can you post a handwritten implementation?
     
  10. ianlostcontrl

    ianlostcontrl

    Joined:
    Sep 18, 2021
    Posts:
    3
    Can you post a handwritten implementation?
     
  11. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,566
    A handwritten implementation of skinning is a bit of mouthful and can take a page of code. I dont' have one ready on hand (been a long time since I wrote one, and it ended in an NDA project).

    It goes like this.

    For each bone, prepare a matrix which is:
    Inverse(bonePoseTransform) * boneWorldTransform.
    . bonePoseTransform is boneMatrix used during binding of skin pose.

    For each vertex, for calculate position as sum of:
    objectSpaceVertex * combinedBoneTransform[boneIndex] * boneWeight
    . A vertex normally have up to 4 bone indexes, and 3 or four weights, which together add to one. If there are 4 influences, but 3 weights, the last one is calcualted by subtracted sum of first three from 1.0f.

    The fun part is that mesh may be split into segments which are affected by a set of specific bones. And the bones referred by the boneIndexes are from that set, and not from the original bone list.

    If you make a mistake somewhere, the model will explode into a mess.

    All this information is accessible in unity through scripting. Although I recall it may be split in interesting way between data in the mesh and the data in the SkinnedMeshRenderer.
     
    TJHeuvel-net and hickv like this.
  12. SunnySunshine

    SunnySunshine

    Joined:
    May 18, 2009
    Posts:
    976
    I'm not sure if BakeMesh uses GPU if you have configured Unity to use GPU skinning, but either way, whether using custom method or Unity, getting the results back from GPU to CPU will remove any advantage you had with skinning on GPU.

    If you want to have a look at how they're skinning you can take a look at their shaders from the archive:
    https://unity3d.com/get-unity/download/archive

    Internal-Skinning.compute
    Internal-Skinning-Util.cginc

    The best way today to skin fast on CPU would probably be to use Unity Mathematics package and write a custom job for the unity job system.

    Why would you want to skin on the CPU though? What is it for?
     
  13. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,566
    The original question is a year and half old.

    A person may want to skin on CPU if they're implementing some sort of modeling software, testing skinning algorithms not available in the core engine, need a quick and dirty way to perform some sort of collision detection with the skin itself, or are trying to capture model pose.

    Unity does not exactly allow tweaking the way skinning algorithm works - the shader part that controls skinning process simply isn't exposed.
     
  14. hickv

    hickv

    Joined:
    Oct 31, 2018
    Posts:
    40
    Hello, sorry to necro this but I am in a similar situation.

    I am trying to implement a custom BakeMesh() myself for performance reasons, but it doesn't seem to be working at all.

    Whats really weird is that I am using this exact same code in a shader and it's working just fine...

    Code (CSharp):
    1. [BurstCompile()]
    2. public struct LinearBlendSkinningPositionJob : IJob
    3. {
    4.     [ReadOnly] public NativeArray<Vector3> inVertices;
    5.     [ReadOnly] public NativeArray<BoneWeight> inBoneWeights;
    6.     [ReadOnly] public NativeArray<Matrix4x4> inSkinMatrices;
    7.  
    8.     [WriteOnly] public NativeArray<Vector3> outVertices;
    9.  
    10.     public void Execute()
    11.     {
    12.         // I am calculating inSkinMatrices as follows:
    13.         // for (int i = 0; i < skinMatrices.Length; i++)
    14.         //     skinMatrices[i] = _boneTransforms[i].localToWorldMatrix * bindposes[i];
    15.         for (int i = 0; i < inVertices.Length; i++)
    16.         {
    17.             Vector3 inPos = inVertices[i];
    18.             Vector3 outPos = inPos;
    19.             BoneWeight bw = inBoneWeights[i];
    20.  
    21.             int4 indices = new int4(bw.boneIndex0, bw.boneIndex1, bw.boneIndex2, bw.boneIndex3);
    22.             float4 weights = new float4(bw.weight0, bw.weight1, bw.weight2, bw.weight3);
    23.  
    24.             for (int j = 0; j < 4; j++)
    25.             {
    26.                 Matrix4x4 skinMatrix = inSkinMatrices[indices[j]];
    27.  
    28.                 Vector3 skinnedPos = skinMatrix.MultiplyPoint(inPos);
    29.                 outPos += skinnedPos * weights[j];
    30.             }
    31.  
    32.             outVertices[i] = outPos;
    33.         }
    34.     }
    35. }
     
    Krishx007 likes this.
  15. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,566
    HOW exactly it "doesn't seem to be working"? It is the important part.

    Skeletal mesh require information that is split across two parts. SkinnedMeshRenderer and Mesh itself. If you mix up the order somewhere, it will break.

    A skin rendering code would look something like that.
    Code (csharp):
    1.  
    2.     Vector3 skinTransform(Vector3 p, BoneWeight weight, Matrix4x4[] matrices){
    3.         var result = Vector3.zero;    
    4.  
    5.         result += matrices[weight.boneIndex0].MultiplyPoint(p) * weight.weight0;
    6.         result += matrices[weight.boneIndex1].MultiplyPoint(p) * weight.weight1;
    7.         result += matrices[weight.boneIndex2].MultiplyPoint(p) * weight.weight2;
    8.         result += matrices[weight.boneIndex3].MultiplyPoint(p) * weight.weight3;
    9.  
    10.         return result;
    11.     }
    12.  
    13.     void drawSkeletonGizmo(SkinnedMeshRenderer[] skelRend){
    14.         foreach(var rend in skelRend){
    15.             var mesh = rend.sharedMesh;
    16.  
    17.             var trigs = mesh.triangles;
    18.             var verts = mesh.vertices;
    19.             var weights = mesh.boneWeights;
    20.             var numBones = mesh.bindposes.Length;
    21.             var matrices = new Matrix4x4[mesh.bindposes.Length];
    22.             for(int i = 0; i < numBones; i++){
    23.                 matrices[i] = rend.bones[i].localToWorldMatrix * mesh.bindposes[i];
    24.             }
    25.             for(int i = 0; i < trigs.Length; i += 3){
    26.                 var aIdx = trigs[i+0];
    27.                 var bIdx = trigs[i+1];
    28.                 var cIdx = trigs[i+2];
    29.  
    30.                 var a = verts[aIdx];
    31.                 var b = verts[bIdx];
    32.                 var c = verts[cIdx];
    33.  
    34.                 var aWeights = weights[aIdx];
    35.                 var bWeights = weights[bIdx];
    36.                 var cWeights = weights[cIdx];
    37.  
    38.                 var a1 = skinTransform(a, aWeights, matrices);
    39.                 var b1 = skinTransform(b, aWeights, matrices);
    40.                 var c1 = skinTransform(c, aWeights, matrices);
    41.  
    42.                 Gizmos.DrawLine(a1, b1);
    43.                 Gizmos.DrawLine(b1, c1);
    44.                 Gizmos.DrawLine(a1, c1);
    45.             }
    46.         }
    47.     }
    48.  
    49.  
    This is ripped out of experimental code and is not efficient.

    I'd suggest to draw a skin gizmo first until it looks right. Then you can write transformed data into an array.
     
    hickv likes this.
  16. hickv

    hickv

    Joined:
    Oct 31, 2018
    Posts:
    40
    Thanks for replying. Looking at your code I realised my mistake. At line 18 I am suming inPos to outPos, when in fact it should be 0.

    Here's the working code:
    Code (CSharp):
    1. [BurstCompile()]
    2. public struct LinearBlendSkinningPositionJob : IJob
    3. {
    4.     [ReadOnly] public NativeArray<Vector3> inVertices;
    5.     [ReadOnly] public NativeArray<BoneWeight> inBoneWeights;
    6.     [ReadOnly] public NativeArray<Matrix4x4> inSkinMatrices;
    7.  
    8.     [WriteOnly] public NativeArray<Vector3> outVertices;
    9.  
    10.     public void Execute()
    11.     {
    12.         for (int i = 0; i < inVertices.Length; i++)
    13.         {
    14.             Vector3 inPos = inVertices[i];
    15.             BoneWeight bw = inBoneWeights[i];
    16.  
    17.             Vector3 skinnedPos =    inSkinMatrices[bw.boneIndex0].MultiplyPoint(inPos) * bw.weight0 +
    18.                                     inSkinMatrices[bw.boneIndex1].MultiplyPoint(inPos) * bw.weight1 +
    19.                                     inSkinMatrices[bw.boneIndex2].MultiplyPoint(inPos) * bw.weight2 +
    20.                                     inSkinMatrices[bw.boneIndex3].MultiplyPoint(inPos) * bw.weight3 ;
    21.  
    22.             outVertices[i] = skinnedPos;
    23.         }
    24.     }
    25. }
     
    Krishx007, wetcircuit and neginfinity like this.
  17. hickv

    hickv

    Joined:
    Oct 31, 2018
    Posts:
    40
    Hello again. Just wanted to share some findings here.

    I did some tests here and actually SkinnedMeshRenderer.BakeMesh() was A LOT faster at skinning than my Burst compiled LinearBlendSkinning Job (Unity 2021.2).

    Maybe Unity is caching the GPU skinned output ? so there is no skinning happening when you call BakeMesh() ? That is the only explanation given the massive speed comparison.

    If anyone get different results please share here.
     
  18. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,566
    The original thread is two years old and a lot of things happened since then.

    I cannot comment on burst code, because I'm not using this part of unity features. (DOTS/etc).

    However, it wouldn't surprise me if BakeMesh ended up faster than C# code, as BakeMesh can be implemented on C++ side of the engine, with fewer middlemen involved.

    Regarding GPU caching, the docs specify that skeletal mesh baking is always done on CPU.
     
  19. Baste

    Baste

    Joined:
    Jan 24, 2013
    Posts:
    6,334
    A Raspberry pi 2 should be able to execute 4,744,000,000 instructions per second, or 4,744,000 instructions per millisecond. So with a good data layout, if what's actually happening is just that number of multiplications and additions, it'd be reasonable to expect it to take in the ballpark of 1ms on that platform, since it'll have to do a bit of extra memory fetching here and there, and adds and multiplies are not single instructions.

    A modern desktop pc is at least 50 times faster than the Pi 2. If it's taking 15ms, then we're talking the operation taking about 750 times longer than my back-of-napkin ballpark of reasonable. Assuming I'm off by a factor of 10, they're taking 75 times as long as reasonable.

    So I'm guessing that the bottleneck is somewhere else?
     
  20. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,566
    And on modern desktop, that is much faster than raspbeery pi, if you start calculating skeletal mesh transformation on CPU in unity editor to draw gizmos, it can very quickly start to lag.

    I believe guessing this way is incorrect in the first place. You need to profile instead. And profiling data is not available for that function. There's more than one place where the slowdown could happen.
     
  21. hickv

    hickv

    Joined:
    Oct 31, 2018
    Posts:
    40
    Here's the performance comparison:

    Burst compiled LinearBlendSkinning:
    Avocet6.png

    Unity's BakeMesh:
    Lark8.png

    Edit: tested on a CPU Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz 2.29 GHz - GPU HD Graphics 5500
     
  22. CodeKiwi

    CodeKiwi

    Joined:
    Oct 27, 2016
    Posts:
    119
    I originally didn't want to use BakeMesh as it created unneeded data (normals, uvs etc). Even after applying a bunch of optimizations BakeMesh was still at least two times faster than my own implementation. Probably the main performance issue I've had using Mesh in the past was when I thought that properties like mesh.vertices, normals were just fields (they're actually properties that create a clone of the array each time). So I'd write loops like mesh.vertices[j] and ended up creating gigs of memory per bake.
    get skinned vertices in real time
     
    Last edited: Feb 9, 2022
  23. pk1234dva

    pk1234dva

    Joined:
    May 27, 2019
    Posts:
    84
    Seems a bit weird, I wouldn't expect the 3x difference you mentioned. Few considerations I'd try out (and I might try it out myself some time):

    Use the mathematics types (convert vector3 to float3, boneweights to int4, float4, matrix4x4 to float4x4).
    To check whether gpu skinning is cached, maybe try disabling the renderer, and run things in update.
    I'd also consider testing this on one very detailed skinned mesh.

    It's also not that hard to imagine that the internal cpu baking procedure could run multi threaded, so something like IJobParallelFor, at least I don't see why that would be an issue.

    It looks like you are scheduling multiple jobs for multiple meshes though, so unless you're completing the jobs sequentially or something, I wouldn't think the difference will be that big.

    Lastly, all the jobs get the same inVertices and inBoneWeights arrays, right?

    Another difference in computation I can think of - if unity knows that a vertex has <4 bones, computation could be more efficient (Mesh.GetBonesPerVertex).
     
    Last edited: Feb 9, 2022
  24. SunnySunshine

    SunnySunshine

    Joined:
    May 18, 2009
    Posts:
    976
    Maybe try IJobParallelFor instead of looping in the Job. Also do you really need MultiplyPoint, shouldn't MultiplyPoint3x4 be enough?