Search Unity

  1. Unity 6 Preview is now available. To find out what's new, have a look at our Unity 6 Preview blog post.
    Dismiss Notice
  2. Unity is excited to announce that we will be collaborating with TheXPlace for a summer game jam from June 13 - June 19. Learn more.
    Dismiss Notice

Official BatchRendererGroup sample: High frame rate even on a budget GLES device

Discussion in 'General Graphics' started by arnaud-carre, Sep 12, 2023.

  1. arnaud-carre

    arnaud-carre

    Unity Technologies

    Joined:
    Jun 23, 2016
    Posts:
    97
    Last edited: Oct 2, 2023
    Rewaken, echu33, DylanF and 7 others like this.
  2. JesOb

    JesOb

    Joined:
    Sep 3, 2012
    Posts:
    1,109
    Hi

    You wrote about Vulkan Metal Dx and Gles 3.0 but what about Gles 3.1-3.2 do it support SSBO instead of UBO?
     
  3. arnaud-carre

    arnaud-carre

    Unity Technologies

    Joined:
    Jun 23, 2016
    Posts:
    97
    BatchRendererGroup with GLES is always using UBO, whatever 3.0 or greater
     
    tachen and JesOb like this.
  4. murilomsq

    murilomsq

    Joined:
    Aug 9, 2023
    Posts:
    10
  5. arnaud-carre

    arnaud-carre

    Unity Technologies

    Joined:
    Jun 23, 2016
    Posts:
    97
    SRP Batcher and BatchRendererGroup are two different things. The former is a generic and automatic option trying to minimize the amount of GPU setup between the draw calls. You don't have to worry about anything, but it's not magic: if you have 1000 cubes to render, you will end up with 1000 drawcalls.

    BatchRendererGroup is a low level API to do explicit GPU instanced draw. It requires more effort (you have to manage GPU memory yourself, and also generate the draw commands ). In dedicated situations such as the shooter sample, where you have plenty of similar objects to render with some custom properties per object, BRG is way faster than SRP Batcher regarding CPU.

    BatchRendererGroup is a new tool for dedicated situations. It's not a replacement for SRP Batcher. In the shooter sample, SRP Batcher is used to render the main ship, enemy spheres and missiles. BRG is used to render the huge amount of CPU animated cubes.


    As SRP Batcher, BatchRendererGroup requires SRP. So it's not compatible with BiRP.
     
    murilomsq likes this.
  6. optimise

    optimise

    Joined:
    Jan 22, 2014
    Posts:
    2,129
    Hi. Any plan to port the project to entities graphics to battle test entities graphic performance? It seems like entities graphic can't achieve the same high performance compares to fully manual written brg that u show.
     
    Last edited: Oct 4, 2023
  7. mgear

    mgear

    Joined:
    Aug 3, 2010
    Posts:
    9,486
    nice to see ready to use examples!

    any way to optimize this part? (just increased background object count to test)
    upload_2023-10-4_16-56-25.png

    *ah ok, can get rid of that, if i dont need to move the objects.
    with 16 million static quads, its then here
    upload_2023-10-4_17-21-46.png
     
    Last edited: Oct 4, 2023
    DevDunk likes this.
  8. arnaud-carre

    arnaud-carre

    Unity Technologies

    Joined:
    Jun 23, 2016
    Posts:
    97
    entities.graphics is also driving BatchRendererGroup.

    As entities.graphics can render any generic ECS scene, it has to do a lot more preparation work than this BRG shooter demo, each frame. Like:

    - react to any ECS chunks change
    - manage (alloc/free) into a GPU memory pool if anything change ( entity spaw, component change )
    - do frustum culling for all entities
    - gather all visible objects and bin them into several batches (depending on shader, material, or some feature flag like transparency
    - and finally, generate BatchRendererGroup draw commands

    It's expected entities.graphics to take more CPU time than just driving BRG. As a result it's more versatile and generic.
     
    optimise and JesOb like this.
  9. arnaud-carre

    arnaud-carre

    Unity Technologies

    Joined:
    Jun 23, 2016
    Posts:
    97
    This part is the GfxBuffer.SetData. Basically it's the data upload from system memory to GPU memory.

    As you noticed, if your data doesn't change ( like everything static ), then you can avoid the upload and run even faster
     
    anshuman24816 likes this.
  10. joshcamas

    joshcamas

    Joined:
    Jun 16, 2017
    Posts:
    1,279
    Does BRG support lighting per-instance? I know in the past it did not.

    Great article, by the way! Love deep dives like this.
     
  11. arnaud-carre

    arnaud-carre

    Unity Technologies

    Joined:
    Jun 23, 2016
    Posts:
    97
    You mean having several point lights in a scene (dynamic or static) affecting each instance?

    Regarding dynamic lighting, we only support HDRP in deferred mode, or URP in forward+ mode.

    Regarding static lighting (GI) both lightmaps or SH are supported. Everything is handled for you if you're using entities.graphics package. But If you want to drive BRG directly, you need to do manage some data yourself. For lightmaps, you have to handle a buffer of lightmap index & scale per instance. For SH you can handle a buffer of SHCoefficients per instance.

    You can also have a look at entities.graphics package source code. It contains a lot of interesting code to drive BRG (including lightmaps and SH)
     
    joshcamas and DevDunk like this.
  12. joshcamas

    joshcamas

    Joined:
    Jun 16, 2017
    Posts:
    1,279
    Fantastic! I know BRG is meant to be super low level so it's not meant to have all features, but glad to hear dynamic lighting is included. Very excited to try this out!
     
  13. optimise

    optimise

    Joined:
    Jan 22, 2014
    Posts:
    2,129
    I understand entities.graphics is much more versatile and generic solution but still it shouldn't takes so much time like over 2ms at mobile platform which is not acceptable. Currently why entities.graphics takes so much CPU time is caused by it's not using burst ISystem. I understand there's couple of graphics api does not burst compatible yet but I think official can split them out as ISystem to burst all the burst compatible items and managed items can't burst goes to SystemBase. I believe by doing so can further reduce the time cost significantly today. But still official should try to make those graphics api burst compatible until entities.graphics able to fully adopt burst ISystem. Currently entities.graphics seems like it's zero burst ISystem adoption.

    Another huge performance issue is main thread stalling caused by low level graphics api i.e. Vulkan, OpenGLES3 implementation is still on main thread and cause entities.graphics systems stuck on main thread and slow down other module like dots physics significantly. Official will need to improve all the low level graphics api supported by entities.graphics off the main thread. I hope these 2 huge tasks official can start work on them asap and hopefully can ship them soon.
     
    Last edited: Oct 6, 2023
  14. IIporpammep

    IIporpammep

    Joined:
    Aug 16, 2015
    Posts:
    39
    Thank you for the article! You've compared BRG to Graphics.DrawMeshInstanced, but what about Graphics.DrawMeshInstancedIndirect?

    Currently I'm using Graphics.DrawMeshInstancedIndirect like this:
    Code (CSharp):
    1. // Per Instance properties.
    2. public struct InstancedMeshProperties
    3.     {
    4.         public float4x4 ObjectToWorld;
    5.         public float4x4 WorldToObject;
    6.         public Vector4 Color;
    7.  
    8.         public InstancedMeshProperties(float4x4 trs, Color32 color)
    9.         {
    10.             ObjectToWorld = trs;
    11.             WorldToObject = math.inverse(trs);
    12.             Color = new Vector4(color.r, color.g, color.b, color.a) / 255f;
    13.         }
    14.  
    15.         public static int Size()
    16.         {
    17.             return sizeof(float) * 4 * 4 + sizeof(float) * 4 * 4 + sizeof(float) * 4;
    18.         }
    19.     }
    20.  
    21. // Job to write data to gpu.
    22.   [BurstCompile]
    23.     public struct WriteToGPU : IJob
    24.     {
    25.         [ReadOnly] public NativeList<Instance> Instances;
    26.         [ReadOnly] public NativeArray<InstancedMeshProperties> MeshProperties;
    27.  
    28.         [WriteOnly] public NativeArray<InstancedMeshProperties> GPUBuffer;
    29.  
    30.         public void Execute()
    31.         {
    32.             for (int i = 0; i < Instances.Lenght; i++)
    33.             {
    34.                    GPUBuffer[i] = MeshProperties[Instances[i].ID];
    35.             }
    36.         }
    37.     }
    38.  
    39.  
    40. // Create buffer for max visible instances.
    41. _buffer = new GraphicsBuffer(GraphicsBuffer.Target.Structured,
    42.                     GraphicsBuffer.UsageFlags.LockBufferForWrite, _maxInstancesVisibleInRuntime,
    43.                     InstancedMeshProperties.Size());
    44.  
    45. // Write to GPU with LockBufferForWrite mechanism to prevent SetData.
    46. new WriteToGPU()
    47. {
    48.     Instances = instances,
    49.     MeshProperties = _instancedMeshProperties,
    50.     GPUBuffer = _buffer.LockBufferForWrite<InstancedMeshProperties>(0, _maxInstancesVisibleInRuntime)
    51. }.Schedule(culling);
    52.  
    53. // Unlock buffer and draw.
    54. _buffer.UnlockBufferAfterWrite<InstancedMeshProperties>(instancesCount);
    55. Graphics.DrawMeshInstancedIndirect(_mesh, 0, _material, Bounds, _argsBuffer, camera: cameraValue,castShadows: ShadowCastingMode.Off, lightProbeUsage: LightProbeUsage.Off);
    56.  
    57. //In shader update matrices and color.
    58. void vertInstancingSetup()
    59. {
    60.     #ifndef SHADERGRAPH_PREVIEW
    61.     #if UNITY_ANY_INSTANCING_ENABLED
    62.     unity_ObjectToWorld = mul(unity_ObjectToWorld, _Properties[unity_InstanceID].ObjectToWorld);
    63.     unity_WorldToObject = mul(unity_WorldToObject, _Properties[unity_InstanceID].WorldToObject);
    64.     #endif
    65.     #endif
    66. }
    67.  
    68. void GetInstancedColor_float(out half4 result)
    69. {
    70.     result = half4(0,0,0,0);
    71.     #ifndef SHADERGRAPH_PREVIEW
    72.     #if UNITY_ANY_INSTANCING_ENABLED
    73.     result = _Properties[unity_InstanceID].Color;
    74.     #endif
    75.     #else
    76.     result = half4(1,1,1,1);
    77.     #endif
    78. }
    79.  
    80.  
    For each visible instance I'm writing to GPU a lot of bytes(InstancedMeshProperties struct), but if I change the previous code to two buffers like described in the article - a persistent buffer<InstancedMeshProperties> with data for all instances and another buffer<int> with visible ids of instances that I'll update with LockBufferForWrite mechanism - will this be slower than BRG? Am I right that If I set data once for this persistent buffer<InstancedMeshProperties> it will stay in GPU memory and I'll too have GPU persistency like in BRG?

    When a scene is rendered using both SRP batched rendering for some objects and BRG rendering for other objects am I right that transparent objects of BRG batches can only be rendered before or after SRP batched transparent objects - we can't sort them together in a correct back to front order?

    Can we use LockBufferForWrite mechanism with BRG?

    Also in your example used single GraphicsBuffer, but don't we need to use Ring Buffer(array of GraphicsBuffers were every frame we write data to the next buffer) to prevent writing from CPU to buffer that currently used by GPU for rendering? I'm doing this for Graphics.DrawMeshInstancedIndirect in the showed code I just omitted usage of buffers for clarity.

    Does BRG support dynamic(additional) lights in URP's Forward or BRG supports them only in URP's Forward+?
     
    Last edited: Oct 9, 2023
    Shikoq and wellmor like this.
  15. arnaud-carre

    arnaud-carre

    Unity Technologies

    Joined:
    Jun 23, 2016
    Posts:
    97
    Using Graphics.DrawMeshInstancedIndirect should be faster than DrawMeshInstanced because it's up to you to provide per instance data ( So Unity doesn't have to alloc/upload buffer to copy matrices and any custom MPB data per drawcall ).
    Btw both DrawMeshInstancedIndirect or DrawMeshInstanced would need to write your own shader code to fetch your custom data. ( so if you want to use urp/lit or hdrp/lit, you have to fork urp or hdrp and modify shader )

    I would say it should run quite similar speed than BRG (because it will basically do the same amount of work). But as you should write your own shader code, you also have to implement any additional feature you would need ( like any flags in https://docs.unity3d.com/ScriptReference/Rendering.BatchFilterSettings.html )

    yes

    when using BRG transparent + sorted instances, it's up to you to generate one single DrawCommand per instance. All these single drawcommands will be injected in the frame renderers list, as standard SRP Batcher objects. So they should all be properly distance sorted ( both BRG instances and standard SRP Batcher objects )

    yes, BRG is using GraphicsBuffer, whatever how you update the data ( using SetData or LockBufferForWrite )

    We use a single buffer because we're using SetData to update content. SertData will properly handle GPU buffer lifetime and garantee you don't have issue with GPU currently proceeding the buffer. ( if buffer is already in flight we just do a copy of the data and push a GPU "data copy" command in command buffer )

    Using LockBufferForWrite is very low level and you should be aware it could be tricky. Like, you need to handle your own ring buffer as you said. You also have to be sure writing aligned data to avoid any CPU slowdown when writing to GPU memory.

    BRG only supports Forward+ for dynamic lights.
     
    tachen, ekakiya, DylanF and 2 others like this.
  16. IIporpammep

    IIporpammep

    Joined:
    Aug 16, 2015
    Posts:
    39
    Thank you for your answers!

    Can you please elaborate about aligned data? It's that I need my struct to consist of float4( or float2 + float2 for example) so on different platforms everything works correctly?
     
  17. arnaud-carre

    arnaud-carre

    Unity Technologies

    Joined:
    Jun 23, 2016
    Posts:
    97
    LockForWrite returned memory will often be write combined memory. So be sure to write this buffer lineary (obviously never "read" from it). Like, do not just write bytes here and there, it will slow down things a lot. Long story short, always write contiguous data in such mapped memory (without holes). If you want more fine grained details, there is an old but really good read about easy mistakes to avoid when dealing with write-combined memory: https://fgiesen.wordpress.com/2013/01/29/write-combining-is-not-your-friend/
     
    IIporpammep likes this.
  18. IIporpammep

    IIporpammep

    Joined:
    Aug 16, 2015
    Posts:
    39
    Thank you!
     
  19. LYHyper

    LYHyper

    Joined:
    Oct 11, 2023
    Posts:
    11
    Currently, I'm using BatchRendererGroup to render a cubes scene in HDRP by using RenderBRG.cs in Unity Graphics repo and get warning "Internal: JobTempAlloc has allocations that are more than the maximum lifespan of 4 frames old - this is not allowed and likely a leak". It looks like the Native BatchRendererGroup release unmanaged memory too late. Is it right?

    EDIT: It seems like a Unity 2022.3.3f1 bug, after upgrade to 2022.3.11f1, the warning disappear.
     
    arnaud-carre likes this.
  20. Pr0x1d

    Pr0x1d

    Joined:
    Mar 29, 2014
    Posts:
    46
    Hi, Can we use BRG to render directly to RenderTexture?
    Currently there is No Instancing function on LowLevel rendering to RTs, Command Buffers does not work and need a Camera to work, aswell as having a quite chunky Initial overhead. Currently have to rely on Graphics.DrawMeshNow to draw fastest as possible but it has downtimes of passing data to GPU over and over again which limits its speed at higher numbers of instances.
     
  21. arnaud-carre

    arnaud-carre

    Unity Technologies

    Joined:
    Jun 23, 2016
    Posts:
    97
    You can for instance create a rendertexture attached to a camera. Then, in BRG OnPerformCulling, you can filter and only generate BRG DrawCommands for this specific camera ( using BatchCullingContext.viewID )
     
  22. Shikoq

    Shikoq

    Joined:
    Aug 5, 2023
    Posts:
    12
    Is it true that ByteAddressBuffer is used for the BRG buffer? If so - why? It's supposed to be more performant to use Constant Buffer for uniform access as far as I know.
     
  23. Pr0x1d

    Pr0x1d

    Joined:
    Mar 29, 2014
    Posts:
    46
    That is sadly no real use here, but thanks for info about this.
     
  24. G1NurX

    G1NurX

    Joined:
    Dec 25, 2012
    Posts:
    69
    The platform compatibility section in the official document says Unity supports BRG on Android using Vulkan and doesn't mention GLES. Does the document need to update, doesn't it ?
     
  25. arnaud-carre

    arnaud-carre

    Unity Technologies

    Joined:
    Jun 23, 2016
    Posts:
    97
    You're right, the doc will be updated. ( BRG supports both Android Vulkan and Android GLES )
     
  26. arnaud-carre

    arnaud-carre

    Unity Technologies

    Joined:
    Jun 23, 2016
    Posts:
    97
    BRG is using ByteAddressBuffer (SSBO) for all platforms except GLES. On modern GPU, SSBO is almost same speed as UBO and more important it's way more flexible. ( no size limit, ability to fast read 32bits values instead of float4 ).

    We used UBO in GLES just because some GLES GPU/Drivers does not support SSBO access from vertex stage.
     
    Shikoq likes this.
  27. yzhuangaa

    yzhuangaa

    Joined:
    Feb 27, 2019
    Posts:
    22
    Will it support non instancing drawcall.Actually i just want a drawpersistant command.The current graphic.rendermesh or graphic.drawmesh called every frame take some time in main thread and it become a bottleneck when having a lot of drawcall.
     
  28. arnaud-carre

    arnaud-carre

    Unity Technologies

    Joined:
    Jun 23, 2016
    Posts:
    97
    If by non instancing drawcall you mean draw of single instance, yes you can generate drawcommand of just 1 instance.
    Regarding main thread, you can start some Burst jobs from OnPerformCulling to generate all DrawCommands you need. So it won't slow down the main thread.
     
  29. yzhuangaa

    yzhuangaa

    Joined:
    Feb 27, 2019
    Posts:
    22
    Sorry i didnt make it clear. I am talking about shader which DOES NOT use gpu instance.Just like graphic.drawmesh but i dont need to call it every frame.Currently graphic.drawmesh/rendermesh need to be called every frame.And our project's target platform is opengles3.0 which means gpu instance is not well supported(and actually I dont need gpu instance in my case,but just want to get rid of meshrenderer overhead and reduce the time i have in main thread to call graphic.drawmesh/rendermesh every frame);
     
  30. Kasperrr

    Kasperrr

    Joined:
    Feb 23, 2019
    Posts:
    13
    Designing a Render System Based on BRG

    I'm in the process of designing a render system based on BRG and currently, I'm consider getting rid of idea of limited windows size, assuming that `window` is always size of whole RawBuffer . I'm confident that my code will never be deployed on low-end devices.

    1. Are there any other considerations or potential issues I should be aware of when opting for a single large buffer?


    2. Would it be possible to implement GPU based culling ? I asume that instances to draw are calculated internally based on visibleInstances array setup in `OnPerformCulling` jobHandle , but would it be possible to move this process to compute shader ? For example have acces to raw GPU pointers for visible instances and draw instances count ?
     
    Last edited: Oct 25, 2023
  31. Wriggler

    Wriggler

    Joined:
    Jun 7, 2013
    Posts:
    133
    Just wanted to jump in here and say that this project (and associated blog post) are really great stuff. Thank you very much to @arnaud-carre and team!

    Ben
     
    Selim_B and Shikoq like this.
  32. hugokostic

    hugokostic

    Joined:
    Sep 23, 2017
    Posts:
    84
    Did you mean using the BRG api like a replacement for ctx.cmd.DrawRendererList(rendererList); in a Custom RenderPass? like the Draw renderers Custom Pass does? (using CustomPass we can actually draw a given mesh/layer to a RThandle that is not the camera, as long you have allocate it, it can be both a custom buffer or an RT.

    For now, factually I doubt we can code it like that, but theorically it would be possible.
     
  33. Pr0x1d

    Pr0x1d

    Joined:
    Mar 29, 2014
    Posts:
    46
    Almost, I am not sure about this new method havent tested it yet as I work on 2020LTS currently. I tried old cmd.DrawRenderer and cmd.DrawMesh but as it invokes Bounds update on every object in that cmd list which on mobile platforms actually accounts for 95% of the performance out the window when using CommandBuffers.

    I am using Graphics.DrawMesh to render objects but now actually switched to runtime combining objects to one mesh using JobSystem to only need one call of Graphics.DrawMesh but this only works when meshes have the same material. Mesh Data as far as I know is on the CPU side and has to be pushed to GPU every time I do call Graphics.DrawMesh. Still the Graphics calls are the killer currently, doing SetPass and calling Graphics.DrawMesh from main thread is the last thing holding the system.
    Also I do not use Renderers for this objects this way there is less overhead of having extra renderer components in game. Currently I simply have one data component with mesh field.

    Use case for BRG with RTs
    - Don't need to combine objects yourself to limit Graphics calls.
    - Push mesh data to GPU only when transform updated or when object gets destroyed, streamed in/out, with current setup I would only push my own culling info so if its visible or not.
     
  34. ArneMarisTriangleFactory

    ArneMarisTriangleFactory

    Joined:
    Aug 13, 2020
    Posts:
    3
    Hi,
    I just tried the sample on a Quest 2.
    - Installed XR plugin management, using Oculus for Android
    - In OpenGLES3 Graphics API it works fine
    - In Vulkan Graphics API the app is fully black and it throws errors in logcat (seemingly before the app properly boots as I can't attach the Unity profiler / console)

    I tried toggling a whole bunch of Vulkan specific settings but none seem to change the outcome. Anyone experienced a similar issue with Vulkan / Quest 2?

    - Quest 2 development build, Il2CPP
    - Unity editor version 2022.3.10f1
    - Universal RP 14.0.08

    Edit: Also tried in Unity editor version 2022.3.14f1 => Same result
     
    Last edited: Nov 23, 2023
  35. sebas77

    sebas77

    Joined:
    Nov 4, 2011
    Posts:
    1,644
    two questions about BRG:

    will the API also sort the batches to minimise render states changes?
    This is ignorance so forgive me if I am saying something stupid: is the OnPerformCulling called for each light too? How does shadow culling work?