Search Unity

SRP Batcher and GPU Instancing?

Discussion in 'General Graphics' started by techmage, Feb 22, 2020.

  1. techmage

    techmage

    Joined:
    Oct 31, 2009
    Posts:
    2,133
    So I am playing with the SRP Batcher a bit and don't understand how it relates to instancing.

    For one the 'Batch' and 'Save by batches' numbers seem messed up when SRP batcher is enabled. Mine say 8 batches and -8 save by by batching. Is that a bug? Or am I supposed to read this differently now?

    Does the 'Enable GPU Instancing' checkbox on materials still do anything? I am looking at drawcalls by looking in the FrameDebugger. I have a scene with 8 cubes in it for testing. I notice they all get put in one 'SRP Batch' entry in the Frame Debugger. But the entry says '8 drawcalls'. Checking 'Enable GPU Instancing' seems to have zero effect on this. Does this mean that single 'SRP Batch' has batched 8 drawcalls to a single batch?

    Or does that one 'SRP batch' represent 8 separate drawcalls? If that is the case why does 'Enable GPU Instancing' checkbox not affect this? Can you not have GPU instancing with the SRP Batcher?

    Does 'Enable GPU Instancing' checkbox on materials still have any relevancy with SRP Batcher enabled?

    Also, does relying on Instance Properties set via the MaterialPropertyBlock on the renderer still have any relevancy? Or does this new SRP Batcher make that obsolete and it's better to just use separate materials for everything?
     
  2. Dragnipurake97

    Dragnipurake97

    Joined:
    Sep 28, 2019
    Posts:
    40
    You can't instance and use the SRP batcher on the same shader since they require different types of buffers.

    As for drawcalls still being high as far as I understand it SRP batcher doesn't merge meshes like static batching so has higher draw calls, but instead takes advantage of the persistent data on the GPU to reduce the expense of setting up lots of different draw calls. So an "SRP batch" just refers to all the draw calls that use a given shader variant which has its data persistently on the GPU so it isn't set up for each call reducing the cost greatly.
     
    vignesh211 and DavidDreamer like this.
  3. techmage

    techmage

    Joined:
    Oct 31, 2009
    Posts:
    2,133
    Static batching I don't think is mutually exclusive to SRP batching? I think static batching does a merge of meshes in the build step, to send pre-merged mesh data to the GPU. Or is it somehow mutually exclusive?

    So is the SRP Batcher really a replacement for both instancing and static batching? Does it actually produce better perf in general than both instancing and static batching?
     
    DavidDreamer likes this.
  4. Dragnipurake97

    Dragnipurake97

    Joined:
    Sep 28, 2019
    Posts:
    40
    Static batching merges at runtime not buildtime since it batches what is currently in view which is why it has a not insignificant CPU/memory cost. SRP batching is better than static batching in general since it's quicker and uses less memory. https://blogs.unity3d.com/2019/02/28/srp-batcher-speed-up-your-rendering/ goes into more detail. I think SRP batching overrides static batching provided it's enabled and the shader supports it.

    SRP batching doesn't replace instancing since if there's lots of the same mesh you will probably get slightly better performance using instancing since it's less memory and all that. I've only just converted my uber-shader over to it I'm just using it where I would normally use static batching and it seems to be running better, generating less batches anyway.
     
    DavidDreamer likes this.
  5. techmage

    techmage

    Joined:
    Oct 31, 2009
    Posts:
    2,133
    So even though the stats panel and the frame debugger give no signifier that instancing is occurring under the SRP Batcher, it still is occurring? So this is just a currently missing feature under the SRP batcher? Or is there a way to discern that GPU instance batching is occurring with the SRP Batcher?
     
    DavidDreamer likes this.
  6. Dragnipurake97

    Dragnipurake97

    Joined:
    Sep 28, 2019
    Posts:
    40
    Instancing is not the same as SRP batching. Instancing is where one mesh is sent to the GPU and rendered multiple times using a set of matrices and/or MaterialPropertyBlocks (to change material parameters per mesh instance). This saves memory since there is just the one mesh. A set of the same mesh drawn in an instance call would be one "batch" but instancing has the limitation of one batch per unique mesh.

    SRP batching sends all geometry to the GPU to be rendered and renders each mesh in its own draw which costs more memory and maybe CPU cycles, but it can batch different meshes together as one so long as they use the same shader variant. The SRP batching stats can be a bit deceptive though since you can have higher draw calls than static batching, but in fact it may be faster since it's the switch between changing states for each draw call that is expensive but the SRP batcher keeps data on the GPU preventing a high cost switch (provided the draws are using the same shader variant) between draw calls.

    You can use both in conjunction with each other (which could be the best approach depending on your situation), so if you have lots of unique meshes like decorations or something they can be SRP batched, whereas highly repeating meshes with the same mesh/material such as walls could be instanced instead. They would need different shaders due to their buffers though so you may end up with two of the same shader, just one has an instance buffer and one with an SRP compatible buffer.
     
    Joshdbb, Extrys, crzdog50 and 5 others like this.
  7. techmage

    techmage

    Joined:
    Oct 31, 2009
    Posts:
    2,133
    Thank you, that brings alot of clarity.

    So to set up these different buffers,

    For SRP batching I do this?
    • You must declare all built-in engine properties in a single CBUFFER named “UnityPerDraw”. For example, unity_ObjectToWorld, or unity_SHAr.
    • You must declare all Material properties in a single CBUFFER named UnityPerMaterial.
    Then I assume for instancing I do this:
    UNITY_INSTANCING_BUFFER_START(Props)
    UNITY_DEFINE_INSTANCED_PROP(fixed4, _Color)
    UNITY_INSTANCING_BUFFER_END(Props)
     
    DavidDreamer likes this.
  8. techmage

    techmage

    Joined:
    Oct 31, 2009
    Posts:
    2,133
    Also another question if you know. Do texture arrays help SRP batching at all?
     
    DavidDreamer likes this.
  9. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,352
    I don't believe that's true. In the editor Unity will generate the static batched meshes when you press play, but for standalone builds the static batched meshes are generated during the build. The original meshes may not exist in the build anymore if they're not used elsewhere.

    Less memory, yes. Quicker, no. Best case static batching and SRP batching is about the same. Worst case static batching is potentially much faster.

    It does not, because static batching is still faster.

    With an extremely high number of objects it's possible for GPU instancing to be faster too, which is why I believe the DOTS hybrid renderer is using instancing instead of the SRP Batcher alone.


    The difference between all of these methods comes down to understanding what's the expensive part of rendering multiple "things" for the CPU. It's mainly how much communication needs to happen between the CPU and GPU. The more things that need to be changed / sent from the CPU to the GPU between each object that gets rendered, the more expensive it is, with certain types of things changing being more expensive than others.


    Basic rendering:
    Take a single mesh. Upload mesh data to the GPU once at startup. Every frame set the shader to use, upload all of the material data, and call Draw(). Rinse and repeat for every object.
    Benefits: Works on literally every GPU ever made, and is extremely flexible. Unity does some work to try to not change the shader or material data if it doesn't have to, so if two meshes use the exact same material & shader variant it'll sometimes only update the position data and render those meshes one after another.

    Dynamic batching:
    Take several meshes with very low poly counts and exactly the same material & shader variant and combine into a single mesh, every single update. Send that mesh data to the GPU, set the shader & material data for that single material, call Draw() once. Done.
    Benefits: Very, very fast on the GPU, even with the cost of uploading a new mesh every frame as it's only rendering a single "thing". Dynamic batching is limited to very simple meshes so this usually isn't a ton of data. Particle systems work like this too. Main cost comes from the CPU combing meshes. Limited per object data can be stored in the mesh vertex data.

    Static batching:
    Take several different meshes with exactly the same material into a single mesh once during build time / play mode startup. Upload that data to the GPU once on startup. Set the shader variant & material data for that single material, call DrawIndexed() for each renderer component as an offset & range of triangle indices. Done.
    Benefits: Very fast on the CPU & GPU, but uses a lot of memory and is, well, static. Is quite flexible in terms of allowing individual renderer components to be hidden for occlusion, LOD, or via script. Unity will also swap the shader variant and data for things like dynamic lights & shadows so only the renderer components that are affected needs to be rendered again rather than the whole batched mesh, but that incurs some cost in doing that swap. Per object data can be stored in the mesh vertex data.

    GPU Instancing:
    Gather up data from several identical meshes with exactly the same material & shader variant. Upload mesh data to the GPU once at startup. Put all of the unique-per-instance data into arrays that can be indexed by instance ID. For basic instancing that's the object to world matrices only. For more complex setups that can be additional arbitrary instanced data, like a color, or UV offsets, etc. (no textures). Those arrays can be constructed manually via script for when calling DrawMeshInstanced, or set on renderer components using a MaterialPropertyBlock, that later of which Unity will put into arrays for you internally.
    Benefits: Very fast on the CPU to render a lot of the same object. Actually a bit slower on the GPU due to the indexed data look up happening in the shader, but overall faster because usually when drawing a few thousand of something often the GPU is sitting idle waiting for the CPU to tell it what to render next so the additional GPU cost is hidden.

    SRP Batcher:
    Gather up data from several different meshes with the same shader. Upload mesh data once at startup. Put per-material data into long lists and upload once at startup*. Put per-object data into long lists and upload every frame. Render each mesh by setting the offsets to the list data and using a Draw().
    Benefits: Much less data being uploaded between each draw and much less communication between the CPU and GPU. Unlike traditional rendering switching between different sets of material data is essentially free as the data is already uploaded.

    The one thing I don't yet understand about the SRP Batcher is what happens behind the scenes when you modify a material's properties or create a new material at runtime. I haven't tested this yet, but I suspect any modifications to a preexisting material will get uploaded, but new materials won't be able to be batched via the SRP Batcher? If they are, I don't really understand why MaterialPropertyBlocks aren't supported.


    Not directly, no. They're beneficial for instancing as they're a "single texture object" that just happens to have multiple layers. GPUs don't support arrays of textures, so it's a workaround for that. You can pass the index as an instanced property so you can visually have "multiple textures" on your instanced objects. It's also handy for things like terrain or other many-textured things to get around the sampler limits.
     
  10. Dragnipurake97

    Dragnipurake97

    Joined:
    Sep 28, 2019
    Posts:
    40
    I would have thought it was runtime (to prevent occlusion culling issues and what-have-you) but looks like you're right it's build time. Per object data (light indices and reflection probes) breaks my static batches (in editor anyway) though so not sure if it is stored per-vertex?
     
  11. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,352
    Which is why each renderer component is still drawn individually as a range of the static batch's triangles, and their original bounds are retained for culling, but that renderer's original mesh no longer necessarily exists.

    The key phrase from my explanation is "can be". Generally Unity does not. Usually this would be done for user defined data, like vertex colors or per object modified UVs (either as UVs or other arbitrary data). Unity does modify the UVs of static batches for lightmaps, but nothing else. But stuff like light indices & reflection probes are either way too much data, or just completely not possible (you can't assign textures to vertex data for example, so I'm not sure how you'd store "reflection probes" per vertex in a platform agnostic way).
     
    xiangshushu likes this.
  12. techmage

    techmage

    Joined:
    Oct 31, 2009
    Posts:
    2,133
    Thanks for that explanation.
     
    xiangshushu likes this.
  13. beatdesign

    beatdesign

    Joined:
    Apr 3, 2015
    Posts:
    137
    Thank you for the explanation!
    This clarify a lot, but generates also a bit of confution from my point of view.

    What does "at startup" mean? I try to guess, please give me a hint on this:
    * If in my scene there are 6 different cars and in my camera frustom there are only 3 of them, when I hit play the 3 visible car meshes are sent to the GPU memory? When the cars exit the camera frustom their mesh are removed from the GPU memory?
    * This process of upload the mesh on the GPU memory is unlinked from the CBuffer construction? I mean: FIRST I upload CarA, CarB and CarC meshes into GPU memory with indices 0,1, and 2, and then when the CPU needs to render CarA prepares a CBuffer with "shader, material properties, Draw(0)" -> perform a DrawCall. Is this right?

    Plese tell me if this is correct:
    Take the example of 2 spheres of 200 vertices each. They don't move. They are the same prefab.
    * If I use Dynamic batching, I upload 1 mesh of 400 vertices in worldspace everyframe to VRAM, and perform one single DrawCall(). Mesh memory in the profiler is unchanged.
    * If I use Static batching, I upload 1 mesh 400 vertices in worldspace "at startup" (when?) to VRAM, and perform one single DrawCall(). Mesh memory in the profiler is doubled (we created a mesh with doubled vertices count)

    Another question: both Dynamic and Static batching upload 400 vertices to VRAM... why Static batching "uses a lot of memory" while Dynamic batching doesn't?


    Any hint about this?
     
    Last edited: Dec 16, 2020
  14. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,352
    I’m not actually sure if meshes are uploaded to the GPU immediately when loading a scene / asset, or on demand (like textures and shaders are) on first use. Most likely it’s on demand like everything else. Just usually mesh data isn’t all that much compared to the textures so it’s hard to notice. I’ve never tried to figure out when it happens since it’s literally never been an issue. Usually shader initialization and texture loading are the bottlenecks in this regard.

    No, they’ll stay in GPU memory until the asset is unloaded by the CPU. Like by loading a new level non-additively or unloading a scene and the asset references are eventually cleaned up, or directly calling unload on an asset. Until then, it’s probably sitting in GPU memory. Technically after even that there’s a possibility it’s still sitting in GPU memory as the drivers might not flush it until it needs to.

    Honestly, no idea what order Unity’s internals go about uploading data in. I mean, technically the mesh data and cbuffers are always “unlinked”, but that doesn’t mean Unity won’t upload them at the same time. Again, it’s minutiae that has never really been a concern for me since other things are usually more impactful.

    Two dynamically batched spheres will use more memory that two statically batched spheres.

    The difference between a dynamically batched mesh and a statically batched mesh is a dynamically batched mesh has to keep around both the original unbatched mesh(es) and the batched mesh and upload it to the GPU every time it changes (which is every frame). A statically batched mesh is batched once when pressing play in the editor, or at build time, uploaded once, and never changed. On the GPU for both cases it’s now a single 400 vertex mesh of two spheres that’s being rendered. Though it’s highly likely the 200 vertex mesh is also uploaded to the GPU as there are a ton of cases where dynamic batching can fall back to rendering objects individually again.

    However static batching is certainly known for using “more memory” than dynamic batching. There are a few reasons for this. One is because static batching is done at build time, it can massively increase the size of builds since it’s creating unique vertices for each static mesh, and there’s no limit. If you have a voxel style game with 1000 of a single 24 vertex cube mesh for the entire game, your build is holding a mesh with 24000 vertices. Build sizes are usually much more obvious to people even if it may not actually have anything to do with actual CPU / GPU memory usage.

    Conversely if you’re dynamically batching that single cube, the build will only have that single cube, but at runtime it’ll be generating and uploading potentially multiple unique meshes totaling as much as 24000 vertices every frame. Or potentially much less if most of them are being culled. Or none at all if Unity decides to not batch them at all which it can do for multiple reasons.

    Would have to be answered by someone else. I still don’t use the SRPs for anything real, so I’ve never had the need to answer this question.
     
    AshwinMods, iSinner, xeniaeo and 2 others like this.
  15. beatdesign

    beatdesign

    Joined:
    Apr 3, 2015
    Posts:
    137
    Thank you very musch for your explanation.

    I have a scene with 1K copies of one mesh that has 100 vertices.
    • Using SRP OFF / Dynamic batching ON => I got Profiler/Memory/Mesh Memory = 500KB
    • Using SRP OFF / Dynamic batching OFF / Static Batching ON => I got Profiler/Memory/Mesh Memory = 16MB
    * From you explanation, I perfectly understand the static batching Mesh memory value. But if Dynamic Batching is doing essentially the same of static batching (the difference is taht Dynamic batching merge all the meshes in one every frame, while static batching merges them once)... why in case of Dynamic batching the Profiler doesn't show up this large Mesh memory allocation?

    Also, another question:
    * I see that if I use SRP Batching ON, in both cases Dynamic batching or Static batching aren't performed, SRP Batching is performed instead. I know that I can't use GPU instancing together with SRPBatching (if I want to use GPUInstancing I have to turn off SRP Batching), but.. why SRP batching wins also against Dynamic and Static batching?

    Thank you
     
  16. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,352
    What about total memory? I suspect the dynamic batching mesh data just isn't being counted in that section.
     
    beatdesign likes this.
  17. cassius

    cassius

    Joined:
    Aug 5, 2012
    Posts:
    125
    Wow, great thread! Very informative.

    @beatdesign were all 1000 objects in camera view?

    My experience so far is that SRP is running much faster than Static batching in my case. Although it looks like reflection probes may be breaking it a bit (which is how I ended up here in this thread).
     
  18. beatdesign

    beatdesign

    Joined:
    Apr 3, 2015
    Posts:
    137
    Please help me to interpret Profiler logs in the attached images: same scene: 1000 objs 100 vertices each, all in camera view. As you can see
    • Static batching: Meshes 16.3MB
    • Dynamic batching: Meshes 452.KB
    If I see other memory values, there is no Dynamic memory value bigger than Static (Used Total/Unity/Mono)... all of them are lower in dynamic batching.
     

    Attached Files:

  19. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,352
    The rest of the stats don't quite match up between those two moments in the profiler (textures, materials, game objects, etc.), and I have no direct insight into how dynamic batching is done internally. Honestly suspect this is a bug / quirk of the profiler as dynamic batching is likely done in Unity's native c++ code just before rendering and isn't being accurately represented in the profiler. Unless I'm missing something major in my understanding of the two systems, there's no reason why dynamic batching should use less total memory in a perfect like for like comparison between static and dynamic batching.

    My only guess is because it is done in c++, the dynamically generated mesh data is created, pushed to the GPU, and immediately flushed so it never shows up in the memory stats since it's not around by the next frame.
     
  20. MartinTilo

    MartinTilo

    Unity Technologies

    Joined:
    Aug 16, 2017
    Posts:
    2,456
    Batching stats in the stats view and the rendering profiler are off when using SRPs on Unity versions pre 2020.2. In 2020.2 we added the ProfilerCounter API for the SRPs to report these stats, which were previously generated in the native rendering code for the built-in render pipeline and the SRPs had no public C# API to report these to.

    (Note, the documentation for ProfilerCounter correctly states it's compatibility with 2020.1 since the API package builds on low level API that was shipped early to support the package. The Profiler Window in 2020.1 however doesn't yet know how to display these counters, either as a new Custom Module or as override for internal Rendering Counters. These changes only landed in 2020.2. Similarly the ProfilerRecorder API which can also consume these counters only landed in 2020.2. Lastly, all of these changes are sadly too complex and risky to backport, so the Rendering Profiler Module and Stats view will have to remain broken with SRPs on earlier versions.)
     
    Brother_77, ekakiya and bgolus like this.
  21. MartinTilo

    MartinTilo

    Unity Technologies

    Joined:
    Aug 16, 2017
    Posts:
    2,456
    As for the memory discrepancy, I suspect either @bgolus might be onto something or this it might be a bug about how shared mesh memory was counted, I'll double check if it could be the later with the colleague who fixed that.

    (EOY Vacations means the answer for this might have to wait until next year ;) )
     
    Last edited: Dec 29, 2020
  22. Louis-N-D

    Louis-N-D

    Joined:
    Apr 17, 2013
    Posts:
    224
    My understanding of why reflexion probes break Static Batching is that, say you have a bunch of level geo that's all set to batch together, but then you have, say, a couple Reflection Probes... Well, TECHNICALLY, you've just gone and created two material instances, one version of the material with one cubemap (from the first probe) and another with a different cubemap (from the second probe)... Now, it's been a WHILE since I tested this myself, so I don't remember if the result is 2 batches or if Unity just throws its arms up and says, "nope!! I can't even!!" and doesn't batch the objects at all.
     
  23. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,352
    Depends on the reflection probe setup. You can have multiple objects that are affected by the same reflection probe that can still get batched. Just like objects affected by point lights will batch together if the exact same point light(s) are affecting them.
     
  24. Noisecrime

    Noisecrime

    Joined:
    Apr 7, 2010
    Posts:
    2,054
    Are you sure that GPU instancing ins't enabled on the material for the objects? Instancing will override/replace Dynamic Batching so you'd need to check in the FrameDebugger to see which system is actually being used.
    • Instanced Event - 'Draw Mesh (instanced) <gameObject name>'
    • Dynamic Batching - 'Draw Dynamic'
    • Static Batch - 'Static Batch'

    Its also my experience that as of 2019.4.21f1 the stats popup and FrameDebugger can be unreliable in some cases. Though with regard to the frame debugger may just be due to Unity being non-obvious in how its subsystems work or that it just doesn't have that information either in Unity at that time or due to further optimizations happening in the 3D api / GPU so they can't explicitly state specific information.

    If drawcalls and state changes are important to you then you should always double check Unity via RenderDoc to get a true understanding of what commands are being sent to the gpu.

    For example Static Batch for multiple gameObjects is sometimes a single draw call and other times is an event that represents multiple drawcalls, though with minimal or no state changes occurring on gpu.

    Static batching has a couple of gotchas too, such as stats values not reporting correctly if the editor is not in play mode. Also for static batching in the editor the gameObjects MUST be enabled at 'play' time for them to be batched. If static gameObjects are disabled at start up then enabled later they will not be batched. I've not tested this from a build so no idea what happens there, but would not be surprised if disabled gameobjects are not added to the static batch.

    Digging deeper into Static batching and its not as simple as a single DrawIndexed per static batch, nor a DrawIndexed per gameObject. Instead its something in between based on criteria that Unity have not to my knowledge revealed and thus the number of DrawIndexed calls can vary depending on what is visible at the time ( As well as initial batching by material etc ). I know that sounds like common sense, but what I mean is deactivate one static gameObject and you might now get two DrawIndexed calls for all the rest or you might still just get one call, deactivate another gameObject and it might go back down to one call. My assumption is that unity is dynamically generating and binding indexBuffers at runtime based on reducing the number of DrawIndexed calls where possible with some limits to prevent it taking more processing time than you'd save.
     
    PutridEx and beatdesign like this.
  25. ngfilms

    ngfilms

    Joined:
    Nov 18, 2015
    Posts:
    30
    can i confirm that with SRP (URP ) , GPU instancing is disabled irrespective of whether you check material GPU instancing, ie : by checking SRP batch in the URP asset file
     
    Last edited: Feb 7, 2022