Search Unity

Question TransformAccessArray Huge performance drop (2.5x-5x) when gameobjects are root.

Discussion in 'Burst' started by mm_hohu, Nov 3, 2022.

  1. mm_hohu

    mm_hohu

    Joined:
    Jun 4, 2021
    Posts:
    41
    Transforms with parents:
    GameObjectsWithParent.png

    Transforms without parent.
    RootGameObjects.png

    TransformChangedDispatch ()
    TransformChangedDispatch.png

    Full Code: (Observed in all Unity versions.2019-2022)
    Code (CSharp):
    1. using UnityEngine;
    2. using System.Collections.Generic;
    3. using Unity.Burst;
    4. using Unity.Collections;
    5. using Unity.Jobs;
    6. using Unity.Mathematics;
    7. using UnityEngine.Jobs;
    8.  
    9. public class TransformAccessPerformanceTest : MonoBehaviour
    10. {
    11.     public Vector2Int counts = new(300, 300);
    12.     private PositionUpdateJob _positionUpdate;
    13.     private TransformAccessArray _transforms;
    14.     private TransformUpdate _transformUpdate;
    15.     private JobHandle _handle;
    16.  
    17.     private void Start()
    18.     {
    19.         _positionUpdate = new PositionUpdateJob(counts.x, counts.y);
    20.         List<Transform> transforms = new();
    21.         for (var i = 0; i < _positionUpdate.Positions.Length; i++)
    22.         {
    23.             var go = new GameObject();
    24.             go.transform.SetParent(transform); // Note: Remove this line. Roughly 2.5x-5x performance drop.
    25.             transforms.Add(go.transform);
    26.         }
    27.         // Note: Multithreading does not change performance characteristics.
    28.         _transforms = new TransformAccessArray(transforms.ToArray() /* , 1 */);
    29.         _transformUpdate = new TransformUpdate(_positionUpdate.Positions);
    30.         // Note: If the game objects are root object, the TransformChangedDispatch (internal task) runs in "all" job threads after the TransformAccessJob, and these processes are very slow.
    31.  
    32.     }
    33.  
    34.     private void Update()
    35.     {
    36.         _handle.Complete();
    37.         _positionUpdate.DeltaTime = Time.time;
    38.         var positionHandle = _positionUpdate.Schedule();
    39.         _handle = _transformUpdate.Schedule(_transforms, positionHandle);
    40.     }
    41.  
    42.     private void OnDestroy()
    43.     {
    44.         _handle.Complete();
    45.         _transforms.Dispose();
    46.         _positionUpdate.Dispose();
    47.     }
    48.  
    49.     [BurstCompile] private struct TransformUpdate : IJobParallelForTransform
    50.     {
    51.         [ReadOnly] private NativeArray<Vector3> _positions;
    52.         public TransformUpdate(NativeArray<Vector3> positions) => _positions = positions;
    53.         public void Execute(int index, TransformAccess transform) => transform.position = _positions[index];
    54.     }
    55.  
    56.     [BurstCompile] private struct PositionUpdateJob : IJob
    57.     {
    58.         public NativeArray<Vector3> Positions => _p.Reinterpret<Vector3>();
    59.         private NativeArray<float3> _p;
    60.         private readonly int _x;
    61.         private readonly int _y;
    62.         public float DeltaTime;
    63.  
    64.         public PositionUpdateJob(int xCount, int yCount)
    65.         {
    66.             _x = xCount;
    67.             _y = yCount;
    68.             _p = new NativeArray<float3>(_x * _y, Allocator.Persistent);
    69.             DeltaTime = 0f;
    70.         }
    71.  
    72.         public void Execute()
    73.         {
    74.             var t = DeltaTime * 2f;
    75.             var offs = 0;
    76.             for (var i = 0; i < _x; i++)
    77.             {
    78.                 var x = i - _x * 0.5f + 0.5f;
    79.                 for (var j = 0; j < _y; j++)
    80.                 {
    81.                     var z = j - _y * 0.5f + 0.5f;
    82.                     var y = math.sin(math.sqrt(x * x + z * z) * 0.4f - t);
    83.                     var p = math.float3(x, y, z);
    84.                     _p[offs] = p;
    85.                     offs++;
    86.                 }
    87.             }
    88.         }
    89.  
    90.         public void Dispose()
    91.         {
    92.             if (_p.IsCreated) _p.Dispose();
    93.         }
    94.     }
    95. }
    96.  
    I have not done detailed testing, but the result is that when Transform is the root, it is slower because of the extra processing (TransformChangedDispatch?). Is this correct behavior?
    Theoretically, root transform should be the fastest. Please let me know if I am wrong in any way.
    Root Transform is not performing well.
     
    Last edited: Nov 3, 2022
  2. Zuntatos

    Zuntatos

    Joined:
    Nov 18, 2012
    Posts:
    612
    I recall a Unity update some years ago where they were talking about transform hierarchy performance improvements. The thing they changed was that now transforms of children were batched together into an array, so that stuff like iterating childs got quicker due to memory locality & caching/prefetch effects.
    The same thing is probably happening here; Parenting them all to the same object puts them all together in memory, improving perf when limited by memory latency/bandwidth.
     
    mm_hohu and DevDunk like this.
  3. xVergilx

    xVergilx

    Joined:
    Dec 22, 2014
    Posts:
    3,296
    Transforms attached to the same root cannot be processed in parallel. That's the major cause.
    Which is visible in the profile screenshot.

    Don't parent transforms that does not required to be parented.
    It will drain performance whether TransformAccessArray is used or not.
     
  4. Zuntatos

    Zuntatos

    Joined:
    Nov 18, 2012
    Posts:
    612
    Wouldn't you want to parent transforms in groups in that case? Enough batches so it works in parallel, few batches so the transforms are packed tighter
     
  5. xVergilx

    xVergilx

    Joined:
    Dec 22, 2014
    Posts:
    3,296
    In general, no. Internal transform system's pretty bad at determining what should be groupped with what. Especially if its a complex hierarchy.

    Best case scenario is to not parent at all. You'll get best thread load balancing this way.
    Otherwise you'll spend time figuring why random transform job runs too slow stalling rest of the processing.
     
  6. mm_hohu

    mm_hohu

    Joined:
    Jun 4, 2021
    Posts:
    41
    I understand what you are saying. But I asked the question because Transoforms "without" parents is by far the slowest.
     
  7. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    398
    Try this, added some comments to help understanding. Also, can wrap your job into the 1 job, otherwise what you were doing is copy A to B as your parallel job (which was running as 1 thread).

    Code (CSharp):
    1. using UnityEngine;
    2. using Unity.Burst;
    3. using Unity.Jobs;
    4. using Unity.Mathematics;
    5. using UnityEngine.Jobs;
    6. using System;
    7.  
    8. public class BurstTransformAccessArray : MonoBehaviour
    9. {
    10.     public Vector2Int counts = new(300, 300);
    11.     private TransformJob _transformJob;
    12.     private TransformAccessArray _transformsAA;
    13.     private JobHandle _handle = new JobHandle();
    14.  
    15.     private void Start()
    16.     {
    17.         _transformJob = new TransformJob(counts.x, counts.y);
    18.  
    19.         // Fixed array size
    20.         Transform[] transforms = new Transform[counts.x * counts.y];
    21.  
    22.         // Only 1 worker thread per root transform, so split into 12 root transforms
    23.         Transform[] parent = new Transform[12];
    24.         for (int i = 0; i < 12; i++)
    25.         {
    26.             parent[i] = new GameObject().transform;
    27.  
    28.         }
    29.         var tempGO = new GameObject();
    30.  
    31.         for (int i = 0; i < 12; i++)
    32.         {
    33.             for (int j = 0; j < transforms.Length/12; j++)
    34.             {
    35.                 transforms[i* transforms.Length / 12 + j] = Instantiate(tempGO, parent[i]).transform;
    36.             }
    37.         }
    38.  
    39.         Destroy(tempGO);
    40.  
    41.         _transformsAA = new TransformAccessArray(transforms, /* 12 */); // Can use worker count to limit I assume
    42.     }
    43.  
    44.     private void Update()
    45.     {
    46.         //_handle.Complete();
    47.         _transformJob.DeltaTime = Time.time;
    48.         _handle = _transformJob.Schedule(_transformsAA);
    49.     }
    50.  
    51.     private void LateUpdate()
    52.     {
    53.         _handle.Complete();
    54.     }
    55.     private void OnDestroy()
    56.     {
    57.         _handle.Complete();
    58.         _transformsAA.Dispose();
    59.     }
    60.  
    61.     [BurstCompile]
    62.     private struct TransformJob : IJobParallelForTransform
    63.     {
    64.         private readonly int _x;
    65.         private readonly int _y;
    66.         public float DeltaTime;
    67.  
    68.         public TransformJob(int xCount, int yCount)
    69.         {
    70.             _x = xCount;
    71.             _y = yCount;
    72.             DeltaTime = 0f;
    73.         }
    74.         public void Execute(int index, TransformAccess ta)
    75.         {
    76.  
    77.             var t = -DeltaTime * 2f;
    78.             var x = index % _x * 0.5f + 0.5f;
    79.             var z = index / _x * 0.5f + 0.5f;
    80.             var sq = math.sqrt(x * x + z * z);
    81.             var y = math.sin(sq *0.4f + t);
    82.             ta.position = math.float3(x, y, z);
    83.         }
    84.     }
    85. }
     
  8. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    398
    With that I got 4x-6x speed up (6-core). Not sure on the translation of your maths though (didn't test that but made a rough version of it).
     
    mm_hohu likes this.
  9. xVergilx

    xVergilx

    Joined:
    Dec 22, 2014
    Posts:
    3,296
    That's the problem with the example.

    You're waiting for the single job to run first to then run a different job to process it in parallel.
    PositionUpdate takes too much time which pushes rest of the jobs further on timeline. Which is why it takes longer in total.

    As for the root / pseudo-optimization for processing hierarchies per thread.
    Its not that good of an idea. Because you won't be able to utilize that for game logic most-likely.
    And even if you do somehow manage - it won't be usable for different platforms, or even devices.

    (Since [even detected] thread count may vary, so the result performance would be undetermined)

    Tl;DR: If you want to sync data or read data from TAA, make sure to do it in a parallel.
    Try running initial example's PositionUpdate job parallel without parent nodes. See if it makes a difference. (Though, you'd probably want to modify it first)
     
    Last edited: Nov 10, 2022
  10. mm_hohu

    mm_hohu

    Joined:
    Jun 4, 2021
    Posts:
    41
    PositionUpdateJob is just generating data for testing; I am not discussing the performance of PositionUpdateJob.
     
  11. Saniell

    Saniell

    Joined:
    Oct 24, 2015
    Posts:
    195
    Correct me if I'm wrong, but on first screenshot you have job that takes 1.5ms on second 0.4ms. I don't see how it's slower?
     
  12. mm_hohu

    mm_hohu

    Joined:
    Jun 4, 2021
    Posts:
    41
    The sample you gave me shows good performance.
    It would be faster to have a parent Transform for each thread.

    I am building a SourceGenerator based Transoform auto-brust system and the project is almost complete.
    The last problem was the question of how granularly to batch Transform to get the best performance.
     
    Last edited: Nov 10, 2022
  13. mm_hohu

    mm_hohu

    Joined:
    Jun 4, 2021
    Posts:
    41
    Screenshots 1 and 2 show the same number of Transforms being processed, but the total CPU time has increased to 5.76 ms after being split into 16 threads due to multithreading.

    I wondered about this unpredictable performance result.

    I understand that there is an overhead to multithreading, but I find this result to be rather poor.
     
  14. Saniell

    Saniell

    Joined:
    Oct 24, 2015
    Posts:
    195
    If I had to guess it may happen because transforms that don't share root are not stored next to each other in memory so you're getting higher parallelism but slower memory access. That is if on first screenshot you have all game objects sharing same parent
     
  15. mm_hohu

    mm_hohu

    Joined:
    Jun 4, 2021
    Posts:
    41
    Indeed, I agree with what you say. Ultimately, it is better to use the mechanism provided by ECS or other systems.
    After all, the performance did not seem to be much different between having a proper parent transform and not having one.
    I accept the performance degradation of multi-threading as unavoidable under the current circumstances.
     
  16. Saniell

    Saniell

    Joined:
    Oct 24, 2015
    Posts:
    195
    Well you can try having 16 root objects, and have children be distributed among them. Therefore you'll force JobSystem to "batch" jobs. I wonder if this would change anything
     
  17. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    398
    I updated the test so I could split them into a number of roots (in blue, defined by the steps), although should probably take out the calculations to show the real difference.

    As for using gameobjects without a parent, then each would be its own root, so you would get some parallel performance but with a lot of roots (90k in this)

    upload_2022-11-10_16-2-10.png
     
  18. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    398
    I got the best around 11, probably because I have 6-core, 12-threads - past that it starts to degrade slowly
     
    Last edited: Nov 10, 2022
  19. mm_hohu

    mm_hohu

    Joined:
    Jun 4, 2021
    Posts:
    41
    Sorry, I was in a hurry earlier and misread the results. The performance seems to have improved with your sample.
     
  20. mm_hohu

    mm_hohu

    Joined:
    Jun 4, 2021
    Posts:
    41
    I need more time to examine the details before I can draw any conclusions.
    I am beginning to feel that it is best to have a minimum number of parent Transforms per thread.
    I will do some more research. (But I have to leave for a while for another job.)
    Thanks for your reply.

    @Saniell @Trindenberg