Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

Question Change Buffer Content every Frame?

Discussion in 'Entity Component System' started by SuperFranTV, Dec 5, 2021.

  1. SuperFranTV

    SuperFranTV

    Joined:
    Oct 18, 2015
    Posts:
    140
    Hello,

    i got a question, how to update the content of a ComputeBuffer every Frame with most performance?

    I got a maximum size of the world with millions of quads (in data), this is saved to a NativeArray, my second question is, where is a nativeContainer stored? is it stored in memory or on cpu i dont know?

    The NativeArray is only created once and then it doents change the content. AllocatorPersistent is used.
    The NativeArray's objects will be created (index = new object) in a Parallel Job.

    Is it better to save such high amount of data in a normal Array outside Burst oder NativeArray?
    i'am only read from it after the Array is setup.

    Thank you :)
     
  2. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    The CPU doesn't really store anything other than caching which it automatically manages. The NativeArray is stored in RAM.
    You don't need Burst to work with NativeArrays. They might be slightly slower to index outside of Burst, but they also don't allocate GC. For that reason, they are usually the preferred choice.

    The fastest way is to use SubUpdate buffers, but these are pretty advanced and you need to work with the async readback API to do it correctly. I would first try using ComputeBuffer.SetData and see if you can get it to work and if it performs well for you.
     
  3. SuperFranTV

    SuperFranTV

    Joined:
    Oct 18, 2015
    Posts:
    140
    How much is it using Ram i cant find it on Profiler?

    I have a Job who sets up the current list of object for the ComputeBuffer, but this is very slow every frame.

    Code (CSharp):
    1.    [BurstCompile(FloatPrecision.Low, FloatMode.Fast)]
    2.     struct LoadVoxels : IJobFor {
    3.  
    4.         [ReadOnly]
    5.         public NativeArray<Voxel> worldData;
    6.  
    7.         [WriteOnly]
    8.         public NativeList<uint2> matrices;
    9.  
    10.         [ReadOnly]
    11.         public int renderDistance;
    12.  
    13.         [ReadOnly]
    14.         public int worldSize;
    15.  
    16.         [ReadOnly]
    17.         public float3 cameraPos;
    18.  
    19.         [ReadOnly]
    20.         public static readonly int3[] DirectionVector = {
    21.             new int3(0, 1, 0), //Up
    22.             new int3(0, -1, 0), //Down
    23.             new int3(0, 0, -1), //Forward
    24.             new int3(0, 0, 1), //Back
    25.             new int3(-1, 0, 0), //Left
    26.             new int3(1, 0, 0) //Right
    27.         };
    28.  
    29.         public void Execute (int i) {
    30.             int ni = i / 6;
    31.             Voxel v = worldData[ni];
    32.  
    33.             if (v.id > 0) {
    34.                 int y = ni / (worldSize * worldSize);
    35.                 int x = (ni - y * worldSize * worldSize) / worldSize;
    36.                 int z = ni - y * worldSize * worldSize - x * worldSize;
    37.                 int3 pos = new int3(x, y, z);
    38.  
    39.                 int d = (i % 6);
    40.                 int3 tarPos = pos + DirectionVector[d];
    41.  
    42.                 int tarIndex = tarPos.y * worldSize * worldSize + tarPos.x * worldSize + tarPos.z;
    43.                 if (tarIndex >= 0 && tarIndex < worldData.Length) {
    44.                     Voxel v2 = worldData[tarIndex];
    45.                     if (v2.id == 0) {
    46.                         if (tarPos.x < renderDistance && tarPos.x > 0 && tarPos.y < renderDistance && tarPos.y > 0 && tarPos.z < renderDistance && tarPos.z > 0) {
    47.                             matrices.Add(new uint2((uint) i, (uint) v.id - 1));
    48.                         }
    49.                     }
    50.                 }
    51.             }
    52.         }
    53.     }
    I run it only with .Schedule because of the List is getting converted to an Array after the Job.

    Is this usefull to run it in a ComputeShader?
     
  4. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    Profiler timeline view? Also are you profiling with safety checks, job debugger, and leak detection disabled (or better yet, profiling in a development build)?
     
  5. SuperFranTV

    SuperFranTV

    Joined:
    Oct 18, 2015
    Posts:
    140
    I got safety checks, job debugger and leak detection off. I'am profiling in Editor, you have a screenshot of its place you mean?
     
  6. eizenhorn

    eizenhorn

    Joined:
    Oct 17, 2016
    Posts:
    2,653
    Depends on the size of the array you allocated. NativeArray itself has a static memory footprint - a couple of fields inside, which is copied if you use it not through ref\in, but that's not much, of course, it does not copy from place to place actual data, as NativeArray only references that through a pointer to that memory.
    upload_2021-12-6_13-37-52.png

    And memory to which m_Buffer points is actual data which takes
    sizeof(T) * Array Length


    Why does he need async GPU readback here, if he's writing data and don't need it back on the CPU side? :)

    Do the same thing with ComputeBuffer.Begin\EndWrite and it will be faster, ComputeBuffer.Begin\EndWrite currently always faster, as it doesn't need additional memory copies and use direct GPU write (even if GPU doesn't support direct GPU memory write and use CPU side buffer copy, it sill will always result in fewer memory copies than using SetData), in addition, if use burst job\direct burst method call will do that faster, even faster is have double buffering logic, with preallocated NativeArray of the same size which will be processed in job, and then have
    BeginWrite -> memcpy from preallocated array to array from begin write (better in bursted job\method) -> EndWrite
    (of course depends on usage double buffering not always the case)
    But yeah it has its own rules:
    • You must always call in strict Begin-End-Begin-End order. Never two begins or two ends in a row.
    • End must be called before using the buffer on the GPU. (very important, we accidentally had flickering issues when used EndWrite in the next frame and usage on GPU was in between)
    • Begin can only be called again when the buffer is no longer used on the GPU.
    • The memory can only be written by the CPU, but it cannot be read (contents are undefined, could be write combined memory which makes reading super slow).
    • Only SubUpdate buffers supported

    @SuperFranTV do you create ComputeBuffer every frame? Do you have preallocated CB and only recreate it (resize) when the matrices list is bigger than the currently preallocated CB? Which resize strategy do you use?
     
    Opeth001 and Enzi like this.
  7. SuperFranTV

    SuperFranTV

    Joined:
    Oct 18, 2015
    Posts:
    140
    Thank you for the answer this is awesome, but i can't find a good Documentation how to use BeginEndWrite on ComputeBuffer?

    I have a NativeArray (which size depends on the amount of objects they calculate in job), this Array goes to the Buffer with SetBuffer currently.

    @eizenhorn
    Currently i create a ComputeBuffer every frame.

    Thats is currently my worst update method:

    Code (CSharp):
    1.     [BurstCompile(FloatPrecision.Low, FloatMode.Fast)]
    2.     private void Update () {
    3.         if (distance(oldPos, cam.position) > 1) {
    4.             oldPos = cam.position;
    5.  
    6.             if (buffer != null) {
    7.                 buffer.Release();
    8.             }
    9.             NativeList<uint2> matrices = new NativeList<uint2>(Allocator.TempJob);
    10.  
    11.             int count = renderDistance * renderDistance * renderDistance;
    12.             loadVoxels = new LoadVoxels { worldData = worldData, matrices = matrices, renderDistance = renderDistance, worldSize = worldSize, cameraPos = cam.position }.Schedule(count * 6, loadVoxels);
    13.             loadVoxels.Complete();
    14.  
    15.             buffer = new ComputeBuffer(matrices.Length, stride);
    16.  
    17.             buffer.SetData(matrices.ToArray());
    18.             matrices.Dispose();
    19.  
    20.             material.SetBuffer(matricesId, buffer);
    21.             bufferCount = buffer.count;
    22.         }
    23.         if (bufferCount > 0) {
    24.             bounds = new Bounds(Vector3.zero, renderDistance * Vector3.one + cam.position);
    25.             Graphics.DrawMeshInstancedProcedural(mesh, 0, material, bounds, bufferCount);
    26.         }
    27.     }
    The job is posted above in my comment.

    I noticed that the FPS drop when I create the ComputeBuffer with the maximum possible size.
    I would like to pack the number of visible voxels in a NativeArray, this and the ComputeBuffer should not waste empty fields.
    I'am really new to this all.

    How would you do the job I posted above and then transfer the data to the GPU?
     
  8. SuperFranTV

    SuperFranTV

    Joined:
    Oct 18, 2015
    Posts:
    140
    @eizenhorn
    Do you speak german, because of your name? xD Das würde es leichter machen.

    I read about the BeginWrite, it needs a ComputeBuffer with "ComputeBufferMode.SubUpdated" for compatibility?
    My ComputeBuffer should contain a NativeArray<uint2> with a Length of maximum ~6.000.000

    Because the Length is:
    RenderDistance = 128 Blocks in all 3 directions
    amountOfSides always 6 Sides for each Block (if all neightbors of the block are empty)

    Length = ((renderDistance * renderDistance * renderDistance) / 2) * amountOfSides
    Why "/ 2" ?
    A chunk filled with blocks where each block has an empty block next to it. So i only need the half amount of blocks.

    But 128 renderDistance is a middle Number, i want to increase it.

    Edit:
    I think sending the data of each element directly to the gpu without buffer will be nice, i dont need a readback further.
    Is this possible?

    Edit2:

    I got it working with a Buffer that is setup on OnUpdate with maximum Size of elements as length. Now i need to know how to use BeginWrite and is this usefull inside a job, so i can remove the NativeList that stores the output of my job?
     
    Last edited: Dec 6, 2021
  9. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    1,131
  10. SuperFranTV

    SuperFranTV

    Joined:
    Oct 18, 2015
    Posts:
    140
    @sngdan

    the problem, i read BeginWrite can only be used when the GPU doesnt need the buffer, but in my case the DrawMeshMethod is always used in update?

    Whats is between the BeginWrite and the EndWrite, is there also buffer.SetData??

    Only for me later because iam not at home:
    ComputeBufferType.Structured, ComputeBufferMode.SubUpdates
     
  11. eizenhorn

    eizenhorn

    Joined:
    Oct 17, 2016
    Posts:
    2,653
    You read above, or from links @sngdan posted (both have my explanations, on forum and in Sebastian's blog comments):
    "Used" meaning - GPU read from that, for example. GPU not doing that whole time after you call DrawMeshInstancedProcedural, Unity has CPU-GPU synchronisation for frames, which means in your case it's safe to call BeginWrite and EndWrite before DrawMeshInstancedProcedural
    In your case buffer usage is
    Graphics.DrawMeshInstancedProcedural
    , then you should BeginWrite somewhere before that call, get NativeArray from that BeginWrite, apply all your data in burst jobs\direct burst method call, then call EndWrite, and after that call Graphics.DrawMeshInstancedProcedural, in this case, will be safe, it's a simple sample, better to improve that and have preallocated native array which use for data population etc, and before Graphics.DrawMeshInstancedProcedural does things mentioned above:
    The main thing here is to do smart AOT preallocation for arrays and CB which will be in a good balance between wasting too much memory or spending too much time for CPU->GPU data transferring
     
    Last edited: Dec 6, 2021
  12. SuperFranTV

    SuperFranTV

    Joined:
    Oct 18, 2015
    Posts:
    140
    Okay iam checked so far.
    I got 3 fps more from changeing my LoadChunks job.

    The Main thing for get the performance back is, that the LoadChunks Job is runing as .Schedule and not as .ScheduleParallel because the NativeList adds elements.

    I have not yet understood how NativeList.ParallelWriter works and how I can put this list in the ComputeBuffer at the end.

    3 things i can try:

    1. get ParallelWriter to work
    2. Create a NativeArray on Start with Length of maximum Objects, then use my Job and setting up the nativeArray (thats no problem so far), here comes the problem, i only want to put the NativeArray elements to the computeBuffer that are not null (or not setup).
    3. using a ComputeShader for the calculations
    - Problem is that i need to transfer the whole Voxel-Database to the ComputeShader to get the calculations done, this should create a ComputeBuffer whichs uses all my ram :/

    So 1. or 2. are good options.
    i wish i can run the whole process on gpu which have more threats to get it faster done.



    For the BeginWrite:
    is this correct?

    Code (CSharp):
    1.             buffer.BeginWrite<uint2>(0, matrices.Length);
    2.             buffer.SetData(matrices.ToArray());
    3.             buffer.EndWrite<uint2>(matrices.Length);
    4.  
    5.             Graphics.DrawMeshInstancedProcedural(mesh, 0, material, bounds, bufferCount);
     
  13. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    Because that is the only way to figure out this (otherwise you just guess 3-4 frames):
    I'm well aware. I'm just suggesting taking baby steps here because with SetData, you don't need a ComputeBuffer pooling and rotation mechanism like you do with SubUpdates. I suspect there may be other performance issues that need to be hunted down first.
     
  14. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    1,131
    No, there is no set data involved. I do not have Unity installed anymore but in case I have some spare time over Xmas, I will try to locate a working example and post it here, I do not think the api has changed since the time i did this…is there no example here in the forum, did you check?
     
  15. Tony_Max

    Tony_Max

    Joined:
    Feb 7, 2017
    Posts:
    334
    Sorry for kinda off topic, but i'm dealing write now with compute buffers too, but for instancing purposes. My question is: is it possible to make some kind of instanceID offset or offset for compute buffers Individually?
    I see the way to do it by passing extra instanceIDOffset to shader, but this requires every shader must have this prop, so i try to find way to avoid this.
    I need it because sorting in my case can break sequences of rendering to multiple draw calls, but i want to optimize using of cumputeBuffers as much as possible (there is no really need to reallocate buffer on GPU if you want to render not whole sequence of instances)

    And my second question for all you guys, how you think it is good to deal with compute buffer resize problem? When some objects got destroyed you need less elements in compute buffer, and you can just not use extra elements, but if you have extra objects to render, then your current compute buffer must be resize (or draw calls must be splitted). The most common solution will be just reallocate on need + preallocate it on start like [InitialBufferCapacity] with DynamicBuffer. What you think? How you deal with it?
     
    Last edited: Dec 7, 2021
  16. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    You need some sort of shader variable to do this correctly. BatchRendererGroup was designed to solve this problem engine-side. I think the latest alpha has the new friendlier API.

    It isn't pretty because there's no unmanaged ComputeBuffer handle type. This is the solution I came up with. It could probably be improved. https://github.com/Dreaming381/Kine...ssets/Kinemation/ComputeBufferTrackingPool.cs
     
  17. eizenhorn

    eizenhorn

    Joined:
    Oct 17, 2016
    Posts:
    2,653
    Nope, if you use Begin\EndWrite you do not need to use SetData, they're mutually exclusive.
    Look at what BeginWrite returns to you. In this code, replace SetData to memcpy from matrices list to array from BeginWrite.
     
  18. eizenhorn

    eizenhorn

    Joined:
    Oct 17, 2016
    Posts:
    2,653
    Mmm maybe language barrier, but didn't get what you mean here, what to figure out here, and what to guess if he just write in one direction. Can you please rephrase this sentence? :)
     
    Last edited: Dec 6, 2021
  19. snacktime

    snacktime

    Joined:
    Apr 15, 2013
    Posts:
    3,356
    Here is a simple generic writer. It just does whole container updates here.


    Code (csharp):
    1.  
    2. public class ComputeBufferUtil
    3.     {
    4.      
    5.         public static ComputeBuffer CreateBuffer<T>(NativeArray<T> data) where T : struct
    6.         {
    7.             int length = data.Length;
    8.             ComputeBuffer buffer = new ComputeBuffer(length, UnsafeUtility.SizeOf<T>(), ComputeBufferType.Default,                     ComputeBufferMode.SubUpdates);
    9.             NativeArray<T> bufferData = buffer.BeginWrite<T>(0, length);
    10.             NativeArray<T>.Copy(data, 0, bufferData, 0, length);
    11.             buffer.EndWrite<T>(length);
    12.             return buffer;
    13.         }
    14.  
    15.         public static void SetBuffer<T>(ComputeBuffer buffer, NativeArray<T> data) where T : struct
    16.         {
    17.             int length = data.Length;
    18.             NativeArray<T> bufferData = buffer.BeginWrite<T>(0, length);
    19.             NativeArray<T>.Copy(data, 0, bufferData, 0, length);
    20.             buffer.EndWrite<T>(length);
    21.         }
    22.     }
    23.  
     
  20. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    The "this" was referring specifically to the following quote block which were your words.

    But to say what I meant explicitly, the only way to figure out when a buffer is no longer used on the GPU is to use async readback. If you are calling Begin/EndWrite every frame, then you need to keep a pool of buffers on rotation so that you don't write to a buffer still being processed from a previous frame in flight. Hybrid Renderer V2 does this, and the code I linked a few posts up also does this.
     
    eizenhorn likes this.
  21. Tony_Max

    Tony_Max

    Joined:
    Feb 7, 2017
    Posts:
    334
    I'm not that kind of complaining about every bad things in my beloved engine people but i've read thread yesterday where one of UT describing problem of DrawMeshInstancedProcedural API about luck of ordering parameter, and this thread was created in 2019, so i'm not so sure :)
     
  22. eizenhorn

    eizenhorn

    Joined:
    Oct 17, 2016
    Posts:
    2,653
    Ah, yeah, didn’t notice “:” from phone:)
    Yep, thats valid for circular fully manageable CB pools, my point was like your with baby steps, but start from second step than starting from Set Data first step:), as in his case for simplicity he can go through simplified 4 frames CB pool and it will be enough as starter point, when every CB will be used just once per 4 frame. And then he can try to switch to “smarter” solution with asyncgpureadback :) but yeah I think we’re on the same page now (mostly me now see your point)
     
    DreamingImLatios likes this.
  23. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    Yes. This is a problem in DMIP. And this is why you want to make lots of smaller batches that share one big buffer. That's what BatchRendererGroup does.
     
  24. SuperFranTV

    SuperFranTV

    Joined:
    Oct 18, 2015
    Posts:
    140
    So I'am got it working with BeginWrite, but the performance only changed 3-4 fps.

    I need to change the job to parallel, that will solve the issue.

    Code (CSharp):
    1.    [BurstCompile(FloatPrecision.Low, FloatMode.Fast)]
    2.     struct LoadVoxels : IJobFor {
    3.  
    4.         [ReadOnly]
    5.         public NativeArray<Voxel> worldData;
    6.  
    7.         [WriteOnly]
    8.         public NativeList<uint2> matrices;
    9.  
    10.         public void Execute (int i) {
    11.             Voxel v = worldData[i / 6];
    12.  
    13.             if (v.id > 0) {
    14.                 int d = (i % 6);
    15.                 if (v.GetNeighborIndex(d) == 1) {
    16.                     matrices.Add(new uint2((uint) i, (uint) v.id - 1);
    17.                 }
    18.             }
    19.         }
    20.     }

    i got some more performance from compacting the job most as possible.

    The size of the list is every time a other size, so a can't precreate a NativeArray. I need to add the parallel write function, but if i change the NativeList to NativeList.ParallelWriter, i can't use its data to put it inside the ComputeBuffer. Is there a way is missing in my eye?
     
  25. Fribur

    Fribur

    Joined:
    Jan 5, 2019
    Posts:
    127
    Cannot quite deduce how you try now to convert you NativeList to an NativeArray. Try
    Code (CSharp):
    1.          
    2. matrices.AsArray()
    3.  
    Would avoid an unnecessary copy (just a view on the underlying Array of the list, no copy).
     
  26. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    1,131
    To/Asarray would be obsolete with beginwrite. I did not try to understand the execute but if this is in the hot path it does not look efficient
     
  27. Tony_Max

    Tony_Max

    Joined:
    Feb 7, 2017
    Posts:
    334
    I've noticed that ComputeBuffer.BeginWrite has computeBufferStartIndex as 1st parameter. It looks like i can call BeginWrite multiple times with offsets.
    But i've also read that this methods should be called one after another "no 2 Begin or 2 End in line". Looks strange.

    And also for what purpose we need to pass int countWritten parameter when calling EndWrite?
     
  28. eizenhorn

    eizenhorn

    Joined:
    Oct 17, 2016
    Posts:
    2,653
    Yes, you always should call Begin-End-Begin-End.
    The purpose is to get a subset of memory for write to buffer and not the whole buffer, you need to fill
    countWritten exactly for the same reason if you got a range of 100 items through BeginWrite and conditionally written only the first 10 (and you can't predict this count at BeginWrite time) then you EndWrite only with that 10 value. The reason I believe is - graphic API\hardware\device not always allows you directly write to GPU memory, in that case, the whole process goes not directly to GPU memory but points to a temporary buffer in CPU memory, in that case with these strides we reduce the amount of data to copy between CPU-GPU (which is one of the main bottlenecks), with these arguments instead of copy 100 items back to GPU when you written only first 10 you just copy back to GPU only 10 written elements.
    But if I'm wrong @SebastianAaltonen or @JussiKnuuttila will definitely correct me :D
     
    Tony_Max likes this.
  29. SuperFranTV

    SuperFranTV

    Joined:
    Oct 18, 2015
    Posts:
    140
    i checked google and many posts, can't get information how i should code the memcpy inside the BeginWrite?

    i found a shorthand for memcpy but dst should be the ComputeBuffer (NativeArray from Beginwrite)

    Code (CSharp):
    1.     public unsafe static void MemCpy<SRC, DST> (NativeArray<SRC> src, DST[] dst) where SRC : unmanaged where DST : unmanaged {
    2.         //ASSERTION: sizeof(SRC)<=sizeof(DST)
    3.         fixed (void* arrayPointer = dst) {
    4.             UnsafeUtility.MemCpy(
    5.                 arrayPointer,
    6.                 NativeArrayUnsafeUtility.GetUnsafeBufferPointerWithoutChecks(src),
    7.                 src.Length * (long) UnsafeUtility.SizeOf<SRC>()
    8.             );
    9.         }
    10.     }
    Code (CSharp):
    1.     public static void SetBuffer<T> (ComputeBuffer buffer, NativeArray<T> data) where T : struct {
    2.         int length = data.Length;
    3.         NativeArray<T> bufferData = buffer.BeginWrite<T>(0, length);
    4.         NativeArray<T>.Copy(data, 0, bufferData, 0, length);
    5.         buffer.EndWrite<T>(length);
    6.     }
    So i want to use MemCpy instead of .Copy, but how?

    Edit:

    Code (CSharp):
    1.     public static unsafe void SetBuffer<T> (ComputeBuffer buffer, NativeArray<T> data) where T : struct {
    2.         int length = data.Length;
    3.         NativeArray<T> bufferData = buffer.BeginWrite<T>(0, length);
    4.         UnsafeUtility.MemCpy(NativeArrayUnsafeUtility.GetUnsafeBufferPointerWithoutChecks(bufferData), NativeArrayUnsafeUtility.GetUnsafeBufferPointerWithoutChecks(data), length * (long) UnsafeUtility.SizeOf<T>());
    5.         //NativeArray<T>.Copy(data, 0, bufferData, 0, length);
    6.         buffer.EndWrite<T>(length);
    7.     }
    is this correct?
     
    Last edited: Dec 21, 2021
  30. SuperFranTV

    SuperFranTV

    Joined:
    Oct 18, 2015
    Posts:
    140
    Last edited: Dec 21, 2021
  31. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    The static function NativeArray<>.Copy just calls MemCpy internally, so using that is fine.
     
  32. SuperFranTV

    SuperFranTV

    Joined:
    Oct 18, 2015
    Posts:
    140
    But why i see so much performance difference between both?
     
  33. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    I don't, so I have no idea why you do.
     
  34. SuperFranTV

    SuperFranTV

    Joined:
    Oct 18, 2015
    Posts:
    140
    Tomorrow i will show the different between both in frames per second


    I checked again and it nearly the same, you are right, this make most things easier for me.
    Thank you.
     
    Last edited: Dec 22, 2021
  35. WildMaN

    WildMaN

    Joined:
    Jan 24, 2013
    Posts:
    127
    Amazing discussion folks, though there is an additional dimension to it - mobile GPUs and their shared memory model. I can totally envision the flow like this:
    - iterate through all chunks of entities to be rendered, call SetData with direct chunk's LocalToWorld's NativeArray<float4x4> pointer
    - dispatch DMIP call(s)

    In theory, it shouldn't involve any memory copy whatsoever; the GPU's MMU just remaps the CPU-space NativeArray<float4x4> pointer to ComputeBuffer's GPU memory space.

    The BeginWrite/EndWrite way explicitly uses at least one memcpy.

    Any thoughts?
     
  36. julian-moschuering

    julian-moschuering

    Joined:
    Apr 15, 2014
    Posts:
    529
    You would still need multiple buffers to have the GPU and CPU run in parallel making the chunk storage unusable. I'm pretty sure the memcpy cost here is neglectable.
     
  37. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    1,131
    c-64
     
  38. WildMaN

    WildMaN

    Joined:
    Jan 24, 2013
    Posts:
    127
    Ok, so I did the test comparing the old way I did rendering and my proposal based on this topic's ideas. The result is mindblowing:

    The old way:
    Unity_LFC2eKP4kq.png

    - allocate NA<float4x4> TRS + NA<float4> data
    - iterate through all entities to be rendered
    * fill TRS with LocalToWorld.Value, multiplied by scale matrix if necessary
    * fill data array with some additional data
    - SetData from TSR and data to compute buffers
    - dispatch


    The new way
    Unity_ck6orh1Pre.png

    - allocate several NA<long>
    - iterate through all chunks of entities to be rendered
    * collect and cast to long native pointers to each chunk's LocalToWorld, scale and data arrays, and number of entities in the chunk
    - iterate through all collected pointers, cast them back to NA and SetData from these casted arrays
    - dispatch


    Test device is Samsung A30S, exactly the same scene in both cases.
    47.5ms data collection + 5ms SetData versus 0.28ms data collection + 3.8ms SetData
    And a healthy reduction of memory bandwidth, which is a big deal for mobile.