huge CPU cost when doing instancing - expected?

stevesan · Dec 18, 2018

Hi all,
I have a minecraft-ish scene with hundreds of thousands of cubes. I turned on instancing, and it dramatically reduced my draw call count (< 500!). However, the Camera.Render profile tree blew up to 184ms. Is this because instancing does incur some CPU cost, and also things like culling for hundreds of thousands of objects?

I realize the right way to do minecraft-like worlds is to build my own large meshes - I'll probably do that soon, but I just wanted to make sure these results were expected, and that I'm not using instancing incorrectly somehow.

Cheerse!

richardkettlewell · Dec 19, 2018

Yes, Unity scans through the visible game objects during rendering, to figure out what it can instance, so there is some cpu overhead in doing this.

As you say, probably you want to combine into larger meshes yourself.

This is also a very fast alternative that bypasses all Unity's default instancing code, but requires managing the instances yourself, rather than via game objects:
https://docs.unity3d.com/ScriptReference/Graphics.DrawMeshInstancedIndirect.html

stevesan · Dec 19, 2018

richardkettlewell said: ↑

Yes, Unity scans through the visible game objects during rendering, to figure out what it can instance, so there is some cpu overhead in doing this.

As you say, probably you want to combine into larger meshes yourself.

This is also a very fast alternative that bypasses all Unity's default instancing code, but requires managing the instances yourself, rather than via game objects:
https://docs.unity3d.com/ScriptReference/Graphics.DrawMeshInstancedIndirect.html
Click to expand...

Cool, I'll try using DMII. If that is an easy win, seems worth doing/learning.

stevesan · Dec 19, 2018

OK I did a quick test using DMI, and it worked great! However, I'm a little perplexed that it still takes 3ms to call DrawMeshInstanced about 100 times (100k voxels -> 1023 batches, given the max). Is that expected? Camera.Render also incurs 7-8ms, and I'm drawing ~3mil tris.

Also, I noticed DrawMeshInstanced doesn't take a bounding box. Is it always computing one on its own? That seems wasteful, since I can compute it and store it.

Again, lots of things I could do to reduce batches and tris (like not render hidden voxels..which are likely the majority), but just wanna make sure I'm using it right

richardkettlewell · Dec 19, 2018

I don’t know precisely what CPU work DMI does, but there will still be some CPU work, eg uploading all those matrices and other per instance params to the GPU, probably computing an AABB from the matrices and the mesh bounds, and a few other bits. I think it calculates an inverse of every matrix on the CPU too. If your boxes use simple unscaled axis aligned matrices that is a waste of memory and performance

DMII is even faster as it relies on you managing your own per instance data in a ComputeBuffer, and computing your own bounds, so you can choose to be smarter about how you do that stuff based on your use case.

Your numbers sound reasonable. Perhaps try use DMII and compare again, you should find it’s almost no CPU work. The docs page I linked gives a full example of how to use it.

stevesan · Dec 19, 2018

got it - was trying to avoid compute buffers, but now's a good time to get into it

hippocoder · Dec 19, 2018

stevesan said: ↑

got it - was trying to avoid compute buffers, but now's a good time to get into it
Click to expand...

I'm in the same boat, it seems all scary but I'll have to have a peek sooner or later at DMII.

Arathorn_J · Jan 11, 2019

I've been running Graphics.DrawMeshInstancedIndirect and its worked beautifully so far but I'm running into a performance bottleneck when I call ComputeBuffer.SetData on around 200,000 values of Matrix4x4 positions...

Essentially I call

PositionBuffer.SetData(PositionsUnitGroupA); //runs at 0.23 ms

PositionBuffer.SetData(PositionsUnitGroupB); //runs at 13.39 ms

Is there some memory allocation bottleneck or am I hitting a performance limitation of using the compute buffer to set the position of my meshes on each frame?

The crazy thing is that all the CPU and GPU calculations per frame take a total 17.35 ms, so without this bottleneck the frame for handling this should only be around 4ms.

Any tips or suggestions would be welcome.

richardkettlewell · Jan 11, 2019

Uploading 200,000 matrices (each being 64 bytes) is almost 13MB of data. That's quite a lot to be uploading each frame.

My advice:
- Are the matrices changing? If not, just call SetData once at initialization, and not during Update.
- If you must re-upload, due to changing data, then choose a smaller data format for your positions. A matrix is pretty much the largest format you could choose. You can make it smaller in a few ways:
* Almost certainly, the 4th element of each row will always be 0,0,0,1. Don't store/upload those, simply set them in the shader. Now you have a Matrix4x3, instead of 4x4 (Unity has no built-in data type for this).
* You are still storing 9 floats for the rotation (3x3 matrix). A quaternion only requires 4 floats, and could be converted to a matrix on the GPU. Or even better, if you don't have any rotation, the 3x3 will always be 1,0,0 / 0,1,0 / 0,0,1, so simply hardcode this on the GPU too. Or, another approach, if you only have simple rotations (sin/cos) you could just upload "angle around X" etc, and build the matrix on the GPU from those.

Best case scenario, is you dont need rotation, and you only upload once at initialization time (static positions). Then you only need 3 floats per voxel instead of 16, and you don't need to re-upload each frame.

Arathorn_J · Jan 11, 2019

Thank you so much for the reply,

I was thinking the 13MB wasn't very much considering the throughput on modern systems, but your reply really gives me some good tips for reducing this overhead.

Yes the matrices change each frame, they are each a unit in the game (combat unit) and move and path find around the map as formations and as individuals as well. Using the new job/burst system this is actually very performant and I was re-implementing my system from using baked meshes (like a flip book) to using DMII with baked bone positions and skin weights to do the skinning on the GPU, which saved my memory consumption by about 99%.

I considered just using Instanced instead of InstancedIndirect because of this issue, but "SetData" with the computebuffer can take a nativeArray which is another nice way to save time from converting a NativeArray to an Array which I have to do for the less complex "Instanced" call. Also having to break up the call into 1023 chunks slows down the system as well.

Is there a function or reference for converting a Quaternion to a 3x3 for unity shaders? That and only passing a float3 for the position will save over 50% off this memory call.

Again thanks for the reply, this has helped immensely with how I'm approaching this issue.

richardkettlewell · Jan 11, 2019

Arathorn_J said: ↑

Using the new job/burst system this is actually very performant
Click to expand...

Awesome!

Arathorn_J said: ↑

Is there a function or reference for converting a Quaternion to a 3x3 for unity shaders?
Click to expand...

Sure, this is how we do it in scripts (this function is available on github somewhere.. we open-sourced our C# at some point recently)

Code (CSharp):

public static Matrix4x4 Rotate(Quaternion q)

{

// Precalculate coordinate products

float x = q.x * 2.0F;

float y = q.y * 2.0F;

float z = q.z * 2.0F;

float xx = q.x * x;

float yy = q.y * y;

float zz = q.z * z;

float xy = q.x * y;

float xz = q.x * z;

float yz = q.y * z;

float wx = q.w * x;

float wy = q.w * y;

float wz = q.w * z;

// Calculate 3x3 matrix from orthonormal basis

Matrix4x4 m;

m.m00 = 1.0f - (yy + zz); m.m10 = xy + wz; m.m20 = xz - wy; m.m30 = 0.0F;

m.m01 = xy - wz; m.m11 = 1.0f - (xx + zz); m.m21 = yz + wx; m.m31 = 0.0F;

m.m02 = xz + wy; m.m12 = yz - wx; m.m22 = 1.0f - (xx + yy); m.m32 = 0.0F;

m.m03 = 0.0F; m.m13 = 0.0F; m.m23 = 0.0F; m.m33 = 1.0F;

return m;

}

Notice it's actually building a 4x4 matrix, but see how only the 3x3 contains "real" data; the rest is hard-coded to 0 and 1.
you could also make this code much more concise by using float3, eg the first 3 lines can be float3 xyz = q.xyz * 2.0f;, and the next 3 could be q.xyz * xyz, followed by q.xxy * yzz, etc.

Good luck!

PS. if not all 200,000 are moving, consider splitting into 2 draw calls (static + dynamic), so you can upload the smallest possible compute buffer each frame.

Arathorn_J · Jan 11, 2019

Thanks a bunch, I'll be working on getting this to work in my skinning shader, doesn't look like any of the calculations are going to impact my performance.

Thats a good call on the dynamic vs static, right now I'm load testing for the possibility of all the units being in motion so I assume dynamic, but I'm thinking the farther LOD units might not need nearly as much updating and I can spread that out a bit if I'm still running into performance issues after the optimizations you have suggested are implemented.

hippocoder · Jan 11, 2019

is ECS-ifying this bit an option for you?

Arathorn_J · Jan 11, 2019

I've struggled with whether to go full ECS on this, the main reason is its a lot easier to debug formations and movements with gameobjects than with entities as I can see all the transformation data visually right on the terrian and in the editor view to help make sure its matching up with the animation rotation and movement. At this point just using jobs and burst have taken care of almost all my performance issues, the only bottleneck right now is the set data on the compute buffer.

Arathorn_J · Jan 17, 2019

I wanted to update with the function I used in the shader in case anyone else is looking for the solution to this.
You just need to pass the quaternion as a float 4 to the compute buffer as your rotation parameter and in the setup you call this function which was adapted from what richardkettlewell shared above.

Code (CSharp):

float3x3 ConvertQuaternion(float4 q)

{

// Precalculate coordinate products

float x = q.x * 2.0F;

float y = q.y * 2.0F;

float z = q.z * 2.0F;

float xx = q.x * x;

float yy = q.y * y;

float zz = q.z * z;

float xy = q.x * y;

float xz = q.x * z;

float yz = q.y * z;

float wx = q.w * x;

float wy = q.w * y;

float wz = q.w * z;

// Calculate 3x3 matrix from orthonormal basis

float3x3 m;

float3 row1;

m[0][0] = 1.0f - (yy + zz);

m[0][1] = xy + wz;

m[0][2] = xz - wy;

float3 row2;

m[1][0] = xy - wz;

m[1][1] = 1.0f - (xx + zz);

m[1][2] = yz + wx;

float3 row3;

m[2][0] = xz + wy;

m[2][1] = yz - wx;;

m[2][2] = 1.0f - (xx + yy);

return m;

}

void setup()

{

float3 positionData = positionBuffer[unity_InstanceID];

float4 rotationQuaternion = rotationBuffer[unity_InstanceID];

float3x3 data = ConvertQuaternion(rotationQuaternion);

unity_ObjectToWorld._14_24_34_44 = float4(positionData[0], positionData[1], positionData[2], 1);

unity_ObjectToWorld._11_21_31_41 = float4(data[0][0], data[0][1], data[0][2], 0);

unity_ObjectToWorld._12_22_32_42 = float4(data[1][0], data[1][1], data[1][2], 0);

unity_ObjectToWorld._13_23_33_43 = float4(data[2][0], data[2][1], data[2][2], 0);

}

Search Unity

huge CPU cost when doing instancing - expected?

stevesan

richardkettlewell

Unity Technologies

stevesan

stevesan

richardkettlewell

Unity Technologies

stevesan

hippocoder

Digital Ape

Arathorn_J

richardkettlewell

Unity Technologies

Arathorn_J

richardkettlewell

Unity Technologies

Arathorn_J

hippocoder

Digital Ape

Arathorn_J

Arathorn_J

Search Unity

Unity ID

Useful Searches

huge CPU cost when doing instancing - expected?

Unity Technologies

Unity Technologies

Digital Ape

Unity Technologies

Unity Technologies

Digital Ape