Search Unity

Question ParallelForTransform Job is performing worse than running on the main thread

Discussion in 'Entity Component System' started by yonicstudios, Jun 25, 2022.

  1. yonicstudios

    yonicstudios

    Joined:
    Jul 15, 2018
    Posts:
    7
    I have a very simple JobParallelForTransform that loops over roughly 27k Transforms, and sets a rotation of each one to a specific value given by a Compute Shader. However, when comparing its performance to simply running the loop in the main thread, using the job is half as performant. I've already confirmed that the Computer Shader has nothing to do with this, as it's still slowing down when the job is given a constant array of values.

    Here is the code I use for the job:
    Code (CSharp):
    1. [BurstCompile]
    2. public struct DoorUpdateJob: IJobParallelForTransform {
    3.  
    4.     [ReadOnly] public NativeArray<float> values;
    5.  
    6.     public void Execute(int index, TransformAccess transform) {
    7.         transform.rotation = quaternion.Euler(math.PI * 0.5f, math.PI * 0.5f * values[index], 0f);
    8.     }
    9. }
    And its scheduling in the manager object (which also spawns the 27k+ objects):
    Code (CSharp):
    1.  
    2. public class BadAppleDoor: MonoBehaviour {
    3.     private float[] tileData;
    4.     private Transform[] allTransforms;
    5.     private NativeArray<float> dataArray;
    6.     private TransformAccessArray transforms;
    7.     private JobHandle jobHandle;
    8.     private void Update() {
    9.            // ...Reading the data from the Compute Shader to tileData
    10.            // Additional cleanup for the Compute Shader is run afterwards.
    11.        
    12.             dataArray = new(tileData, Allocator.TempJob);
    13.  
    14.             var job = new DoorUpdateJob {
    15.                 values = dataArray
    16.             };
    17.  
    18.             transforms = new TransformAccessArray(allTransforms);
    19.             jobHandle = job.Schedule(transforms);
    20.     }
    21.     private void LateUpdate() {
    22.         jobHandle.Complete();
    23.         transforms.Dispose();
    24.         dataArray.Dispose();
    25.     }
    26. }
    27.  
    Attached below are the Profiler results for both Main Thread and ParallelForTransform Jobs, as well as a Deep Profile for the latter version (the Main Thread one stalls considerably while doing a Deep Profile). It seems like the workers aren't doing anything.
     

    Attached Files:

    Last edited: Jun 25, 2022
  2. xVergilx

    xVergilx

    Joined:
    Dec 22, 2014
    Posts:
    3,296
    Few tips:
    - Don't re-allocate TransformAccessArray each frame, it will produce GC and is extremely slow to setup. Cache it.
    - If transforms are parented to same parent, they will not be processed in parallel.
    Best case is when processed transforms do not have parent at all.
     
  3. vectorized-runner

    vectorized-runner

    Joined:
    Jan 22, 2018
    Posts:
    398
    Have you tried calling complete on the jobs on Update, then starting the jobs?
     
  4. Kmsxkuse

    Kmsxkuse

    Joined:
    Feb 15, 2019
    Posts:
    306
    I just copied Unity's CopyTransformFrom/ToGameObject system for my transform access processing.

    Code (CSharp):
    1.  
    2. protected override void OnUpdate()
    3. {
    4.     var access = _query.GetTransformAccessArray();
    5.     var data = new NativeArray<GoTrs>(access.length, Allocator.TempJob);
    6.     Dependency = new PullLocalToWorld
    7.     {
    8.         LocalToWorlds = data
    9.     }.ScheduleReadOnly(access, 32, Dependency);
    10.     Entities
    11.         .WithName("SyncGameObjectTrsWithEntity")
    12.         .WithAll<SyncTrsFromGo>()
    13.         .WithDisposeOnCompletion(data)
    14.         .ForEach((int entityInQueryIndex, ref GoTrs local) => { local = data[entityInQueryIndex]; }).Schedule();
    15. }
    16. [BurstCompile]
    17. private struct PullLocalToWorld : IJobParallelForTransform
    18. {
    19.     public NativeArray<GoTrs> LocalToWorlds;
    20.     public void Execute(int index, TransformAccess transform)
    21.     {
    22.         LocalToWorlds[index] = new()
    23.         {
    24.             Matrix = transform.localToWorldMatrix
    25.         };
    26.     }
    27. }
    28.  
    Where GoTrs is just a wrapper component around a Matrix4x4.

    I remember why my first blurb didn't work. TransformAccessArray is annoying that way.
     
    Last edited: Jun 26, 2022
  5. xVergilx

    xVergilx

    Joined:
    Dec 22, 2014
    Posts:
    3,296
    _query.GetTransformAccessArray do re-allocate array when entity count changes (unless it was changed in 0.50/0.51);

    So its worth maintaining TAA manually.
     
  6. Kmsxkuse

    Kmsxkuse

    Joined:
    Feb 15, 2019
    Posts:
    306
    Code (CSharp):
    1. if (state.Data.isCreated && orderVersion == state.OrderVersion)
    2.     return state.Data;
    3. state.OrderVersion = orderVersion;
    4. UnityEngine.Profiling.Profiler.BeginSample("DirtyTransformAccessArrayUpdate");
    5. var trans = group.ToComponentArray<Transform>();
    6. if (!state.Data.isCreated)
    7.     state.Data = new TransformAccessArray(trans);
    8. else
    9.     state.Data.SetTransforms(trans);
    10. UnityEngine.Profiling.Profiler.EndSample();
    11. group._CachedState = state;
    Unity already does that for you and doing it manually just means at best reinventing the wheel and at worst using an invalid TAA.

    State originates from a per entity query cached data. Not a universal entity component storage cached state. So as long as the query does not expand or change, the cache remains valid and returns as such.
     
  7. xVergilx

    xVergilx

    Joined:
    Dec 22, 2014
    Posts:
    3,296
    Cache gets trashed each time query changes. As you've mentioned.
    For example when new entity is added or removed, TAA will be re-created.
    That could be a pretty large GC spike which can be observed via Profiler depending on entity / transform count.

    In a dynamic game its a no go. Especially when transform count gets in 1k+ range.
    From my testing, its faster to allocate larger array, and check in a job whether CDFE has a component than constantly re-allocate TAA.

    See my post here:
    https://forum.unity.com/threads/imp...form-system-performance.1048907/#post-6797624

    If query / entities do not change often (or never) then it could be used as is. But its super inefficient, and in fact better API should be provided for transform access.


    Main issue here is Transform being managed component.
    But I don't think that will change anytime soon.
     
    Last edited: Jun 26, 2022
  8. Kmsxkuse

    Kmsxkuse

    Joined:
    Feb 15, 2019
    Posts:
    306
    Yea, I agree on that. My situation has entities created at startup and then never changed, added or removed, so I stick with the current API.

    IIRC, Unity was planning on deleting the TransformAccess pathway back in the 0.15 to 0.17 patch upgrade but quickly reversed when a lot of people complained it would break everything. So they're already disappointed with the performance but havent offered up any better.

    Guessing by the current preliminary BatchRenderingGroup API on the 2022 alpha currently, the plan is to completely decouple the transform of the GameObject from the Entity and pull the actual rendering LocalToWorld matrix from the Entity L2W rather than the GO's Transform Matrix4x4. Makes sense, I do that myself with my own custom sprite renderer but my physics go through Physics2D module in core unity so all movement must first be passed to Monobehavior before being mirrored on the ECS side.

    That way, the entire pathway becomes irrelevant in a pure DOTS system. Hybrid though, uhhh gets shafted?
     
  9. xVergilx

    xVergilx

    Joined:
    Dec 22, 2014
    Posts:
    3,296
    Frankly, it would be nicer to have "Transform" living as a separate "hybrid" / lookup system.

    So that both GO and Entities can have access to TRS without going external hoops to the C++ side.
    That way data could be accessed / queried as a "subtree" from the Entities side just by adding dependency to the system. Moreover, with such approach modifications to the jobs / transform mechanisms / package would be possible. For example, some niche optimizations for 2D.

    But it would probably break lots of (if not all) legacy projects though.
     
  10. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,258
    I believe using TAA Add and RemoveAtSwapBack avoids the GC and also provides a way to incrementally update a TAA.
     
    Krajca and xVergilx like this.
  11. yonicstudios

    yonicstudios

    Joined:
    Jul 15, 2018
    Posts:
    7
    Caching the TAA seems to make it perform on par with not using parallel jobs, so I assume it won't perform any better without delving into ECS? The transforms don't have a parent.
     
  12. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,258
    What does your caching code look like?

    TAA is definitely faster when used correctly, as I demonstrated in this demo: https://github.com/Dreaming381/HeartPromoDOTS

    Specifically, compare version 5 and version 6 of the manager.
     
  13. xVergilx

    xVergilx

    Joined:
    Dec 22, 2014
    Posts:
    3,296
    If its not executed in parallel and job is kept on main thread - then there's probably a reason why.
    See Profiler's Timeline.

    In theory, offloading to the multiple threads should make it faster.

    Try tearing down stuff, leaving bare minimum to figure out what's preventing it.

    Edit:
    I think I know why. Try not to call .Complete in LateUpdate. See if it makes a difference.
    Also, you can schedule native collection to dispose by using .Dispose(jobHandle) method on the collection.
    This way collection will get disposed automatically after job completion.
     
    Last edited: Jun 27, 2022
  14. TTG-Quintessential

    TTG-Quintessential

    Joined:
    Dec 5, 2017
    Posts:
    7
    Can I ask why not to call .Complete in LateUpdate? like why do you mention this, what's your findings. I have fond this post because I DO call in LateUpdate and I have this problem... I can try to refactor, but if you have any insight as to the whys and the wherefores, that would be great
     
  15. xVergilx

    xVergilx

    Joined:
    Dec 22, 2014
    Posts:
    3,296
    On large number cases its beneficial to just let jobs run until the next frame instead of instantly completing.
    Less main thread stalling, less job system stalling.
    Do .Complete before scheduling new job chain instead for this case.

    While it *may* produce one frame delay in most cases is unnoticeable for the user. And the speed benefits are usually worth it on the large scale.

    Check profiler first where bottleneck is. In most cases issue is not the jobs scheduling but poor TAA management.
     
  16. n3b

    n3b

    Joined:
    Nov 16, 2014
    Posts:
    56
    If you take a look at TransformParallelForLoopStruct you'll notice it pulls two collections
    Code (CSharp):
    1. int* sortedToUserIndex = (int*) (void*) TransformAccessArray.GetSortedToUserIndex(output.TransformAccessArray);
    2.         TransformAccess* sortedTransformAccess = (TransformAccess*) (void*) TransformAccessArray.GetSortedTransformAccess(output.TransformAccessArray);
    Based on the method names, one might conclude that TAA is an unordered collection. However, it is unclear whether these collections are sorted upon insertion or during job scheduling. It's possible that any sorting (if it occurs) is performed on the main thread, and the job itself may be so inexpensive as to be undetectable in the profiler (which would require zooming in for detection).

    Besides, quaternions construction can be vectorized - that will give another ~4x boost.

    Edit: Just noticed it's a year old post, sry :D
     
    Last edited: May 28, 2023