Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

Feedback Static Batching object sorting wrong, needs scriptable API

Discussion in 'Graphics Dev Blitz Day 2023 - Q&A' started by Error-md1, May 25, 2023.

  1. Error-md1


    Nov 9, 2022
    Unity's static batching system decreases draw call overhead by combining static meshes into one large mesh buffer, which allows the renderer to draw each object by setting a start and end index in the buffer to draw from rather than binding a new mesh for every call. However, it also has the capability to massively reduce the total number of drawcalls, but to what extent it does this (if at all) is completely dependent on the order the meshes are appended in the combined buffer.

    This optimization happens when it finds multiple draws using the same shader variant and shader inputs. If their vertices form a continuous range in the combined vertex buffer, all meshes are drawn in a single call. This decreases CPU overhead but has the potential to increase GPU overhead by causing overdraw due to a lack of front-to-back sorting. If reducing CPU overhead is our goal, the ideal case is each visible material will only take one call to draw all meshes using it. What stops this from happening is if the objects of the same shader/inputs aren't continuous in the buffer. Then the renderer has to issue a new drawcall at each gap to bind a new start and end index.

    As the static batching system sorts renderers by material before appending them, the most common cause of a gap is simply culled renderers. Thus, in order to optimize the drawcall reduction we want the renderers to also be sorted spatially such that the visible renderers within the current view-frustrum and occlusion zone are close neighbors in the combined index buffer.

    Prior to 2023, the batching system made no attempt to spatially sort renderers. As a practical matter, most renderers in any scene of a reasonable size will be culled. Thus there are almost certainly gaps between every draw's indices and the drawcall reduction never happens. This issue has been pointed out on this forum (See: The forum-member Apkdev created a patch that used some hacky methods to redirect Unity's static batching mesh-sorting function to a custom one that did a Hilbert-Curve index sort, one of the best methods for spatially sorting 2d points. Unity responded, and "patched" it in 2023. The problem is instead of copying Apkdev's correct implementation, they literally copy-pasted another form member's suggested sort which is so hilariously wrong as to be ludicrous.

    Code (CSharp):
    1.                    // simple "spatial" sort on axes (prefer x/z plane)
    2.                    auto lPos = lhs.localToWorld.GetPosition();
    3.                    auto rPos = rhs.localToWorld.GetPosition();
    4.                    if (lPos.x != rPos.x)
    5.                        return lPos.x < rPos.x;
    6.                    if (lPos.z != rPos.z)
    7.                        return lPos.z < rPos.z;
    8.                    if (lPos.y != rPos.y)
    9.                        return lPos.y < rPos.y;
    If it isn't blatantly obvious, this only sorts on the X-axis unless objects have exactly the same X value, then the same on Z then Y. This is not a spatial sort in any form, and does not help in the vast, vast majority of cases. To make matters worse, the code responsible for the sort was moved to C++, making it impossible to patch the function like we used to.

    Replace the wrong "treat each axis as 1d" sort with a Hilbert Index sort. Such a sort should operate on the render's bounds center rather than the transform. Additionally, during the material sort a renderer's material submeshes should be split and sorted separately to prevent multi-material meshes from automatically breaking continuity in the index buffer.

    Here's a useful reference on implementing a hilbert index sort, and Apkdev's implementation (which is MIT licensed):

    Additionally, it would be very useful to have an API to override the sorting method with a user-defined one or an API to resort the results of the default method. The Hilbert curve sort is good, but can fall apart in certain situations. For example having a thin wall or floor between rooms can result in objects on either side being grouped together. In this case it might be useful to make your own sort that operates on user define zones where objects within each zone are sorted together, then the zones are sorted and their meshes appended. This would allow the user to guarantee each room's meshes are always continuous, and that meshes of the connected rooms are also neighbors.

    The ultimate solution however is to use indirect draws for graphics API's that support it (DX12, Vulkan). This allows drawing each bin of shader variant+shader input data combination with a single an indirect draw call regardless of gaps or sorting between submeshes. This guarantees the absolute ideal state for reducing CPU overhead no matter what, and also allows for the sorting the draws within the indirect buffer based on Z distance to simultaneously achieve optimal overdraw reduction.
  2. richardkettlewell


    Unity Technologies

    Sep 9, 2015
    I'll try to reply, as this feedback was also raised on the original thread. I must point out that it was not my code, nor my investigation/solution. However, I will try to provide the best information i can.

    Some of the language you're using here isn't great. I'm hoping that me replying can provide some context and constructive discussion.

    It is sorting the objects based on their coordinates. I think that makes it a spatial sort. But admittedly I haven't checked the definition. But anyway, the point of the change is this:
    * It is stable and predictable (please remember the previous code sorted by Instance ID, which was as good as rand())
    * It is grouping objects based on some locality rules. Sure, it's not ideal, but there are a great many cases where this code gives improvements, and frankly it is hard to argue that the old code, which I will repeat, was random, could be seen as better in any use case
    * It is very low risk. I strongly believe that this code improves some use cases a lot, others a bit, probably some use cases are not really improved at all, and, crucially, that no use cases are worse than before
    * It is very fast to compute.

    We tested it with quite a lot of content, and based on our findings, I must disagree with this. It was a considered change, tested on a variety of content.

    Curves like these require integer based coordinates. This works great for example in texture data, where the concept of the overall bounds is baked into the problem domain (texture dimensions and integer texel coordinates)

    For positions, some quantization is required. Which in turn requires some assumptions to be made about the world scale. Or a scale project setting that is an advanced topic that is hard to expose in a nice way. You can see the quantization in the linked sample. I'm not saying we couldn't find something that works here, but already it is a more complex problem, and the risk is higher. We wanted to make a positive change in a reasonable time frame.

    e.g. check this out in the sample, explaining a hardcoded value:
    This is fine for random internet code, and I'm sure it works great for loads of people, but we can't just throw this kind of stuff into the engine for everybody.

    I agree that losing the possibility for customisation is not goood. Some unrelated optimization work moved the code into C++, because running it in C# was slower by comparison.

    On the topic of performance, the linked sample is also doing a bunch of for-loops and looks less than trivial, at a glance. So work would have to be done to ensure it was fast enough. This code runs during Enter Playmode, which is a huge pain point performance-wise. So we must be very careful not to make it slower. It might be fine, but again, it's more risk.

    I hope this explains things a little better. We will consider whether we can add an API for custom sorting.
    Last edited: May 26, 2023
  3. Error-md1


    Nov 9, 2022
    Sorry for being rude. I'm just very frustrated as this issue has been a thorn in my side for the longest time. I had an (unsupported) solution working in 2021, and got some pretty measurable performance gains out of it. The issue was recognized, but instead of exposing an API for the sorting it got buried deeper and put out of reach. To make matters worse, the solution that ended up getting implemented looked from my perspective like a direct translation to c++ of some hastily written code from the forums. Also doesn't help that the sorting was only brought to 2023, 2022 has the unfortunate problem of having no sort and no way to patch it.

    The problem with the current sort is its very strongly anisotropic. You will get performance benefits simply by rotating levels. If you have two disconnected rooms, and they are aligned on the x-axis, the current algorithm will very likely interleave the meshes of the two rooms in memory resulting in close to 0 draw grouping. Rotate the setup 90 degrees and suddenly the batching will be more efficient. If you have wide-open spaces, you'll find you get better batching performance looking along the x-axis than along the z, creating a pretty bizarre and difficult to debug performance issue. This sorting is pretty much all around bad.

    As far as the actual Hilbert curve index algorithm goes, its just one form of space filling curve used for such a sort. Space filling curve indicies are the best (that I know of anyways) way to translate 2d or 3d positions into 1d coordinates in a manner that preserves their spatial information. They're used absolutely everywhere in graphics applications where you need to put multi-dimensional information into 1d memory arrays, from octtrees to the physical layout of textures in memory. They are very well proven for use-cases such as this. The most commonly used curve for real-time is the Z-Order Curve which is blazing fast, being just a handful of bit-shift operations. If you're worried about compute time, that would be the one to go for. The reason I suggested the Hilbert curve (and why Apkdev used it) is it has far better spatial properties than the Z-Order curve at the cost of being pretty complex to implement it in an efficient manner. As the sorting for batching is done at build time, it seems like its worth the trade-off.

    As you pointed out, the issue with a curve index sort is you need to define a fixed domain. They work on integer coordinates, so you have to define some bounding volume around your data to be able to express the positions as integers. With the Z-order curve, you express each position as three 21 bit unsigned integer fractions of the bounding box's size. That gives you a granularity of ~1/2 millionth of the bounding box's scale. Static batching doesn't need extremely fine scale spatial sorting, just enough to ensure what is onscreen is all neighboring in memory. Assuming that our game is at human scale and 1 unity unit = 1m, we could easily get away with a 320 km cubed bounding volume which will give us sorting on the 0.1m scale. That of course makes sweeping assumptions about what kind of game we're making, and what scale we have chosen unity units to represent. A much better solution would be to fit the bounding volume to the content. This should ensure the best granularity no matter what scale we're working at

    The best solution is of course to add an API so we can figure out the best solution for our own use-cases, but that's significantly more complicated than modifying a single function to sort things better.
  4. richardkettlewell


    Unity Technologies

    Sep 9, 2015
    You're right about an API.

    If you have some time, give this custom Editor build a go and let me know what you think:

    It adds a new script API: StaticBatchingUtility.SetCustomSortingDelegate.
    You give it a method, and that method should be called with System.Span<StaticBatchingUtility.BatchingData> instances as the parameter. You can then sort these however you like.

    I've not had any chance to do any testing on it yet, or get any internal feedback, but it's hopefully a starting point.

    One question in my mind is, can you get the bounds from the provided renderer instance ID? Last time I checked, I think it wasn't trivial to get a renderer from just an instance ID. But, assuming you can, you should be able to get the bounds from the renderer, if you want to sort using that instead of just the center.

    EDIT: Looks like we have, but it's editor-only.
    Last edited: Jun 6, 2023
    Peter77 and goncalo-vasconcelos like this.