Search Unity

Jobs Performance questions.

Discussion in 'Entity Component System' started by Gru, Mar 5, 2018.

  1. Gru

    Gru

    Joined:
    Dec 23, 2012
    Posts:
    142
    I've recently tried the 2018.1b9 version and tried the jobs code that Joachim was showing.

    private TransformAccessArray _transforms;
    private NativeArray<float> _speeds;
    private JobHandle _jobHandle;

    void Start () {
    // Instantiating objects and setting up references.
    }

    struct RotateJob : IJobParallelForTransform {
    [ReadOnly]
    public NativeArray<float> speeds;
    public float deltaTime;

    public void Execute(int index, TransformAccess transform) {
    transform.rotation *= Quaternion.AngleAxis(speeds[index] * deltaTime, Vector3.up);
    }
    }

    void Update () {
    _jobHandle.Complete();

    var job = new RotateJob() {
    speeds = _speeds,
    deltaTime = Time.deltaTime
    };
    _jobHandle = job.Schedule(_transforms);
    }

    private void OnDestroy() {
    _transforms.Dispose();
    _speeds.Dispose();
    }

    The performance profiles shows like in this image: https://ibb.co/hwpvCn
    Questions:
    1. Shouldn't the JobParallelForTransform paralellize the work more? It seems to be doing everything on one worker thread, while other threads are idle.
    2. I understand the Render bounding volumes has to run after the Transforms job. However, couldn't that one be paralellized too? Also, are there any tricks to speed this process up? It seems this super-parallel code will be most useful with many GO instances, but I struggle to see how the RenderBoundingVolumes will not become a bottleneck in these cases. I wanted to try turning off the renderers that behind the camera, but they aren't exposed in a native collection.
    3. I've seen `_jobHandle.Complete();` in a couple of places just at the start of update in that presentation. However, doesn't that limit the jobs to be running for only one frame? I have tried it also with `if(!_jobHandle.IsComplete) return;` but haven't seen performance differences. What happens with Render bounding volumes in this case, does it have to refresh all the bounding volumes anyway, whether they have been updated by the job this frame or not? Does that mean we can't have multi-frame transform jobs (but they would hopefully be paralellized)?
     
  2. timjohansson

    timjohansson

    Unity Technologies

    Joined:
    Jul 13, 2016
    Posts:
    473
    IJobParallelForTransform specifically has a limitation that it can not split one transform hierarchy into multiple jobs. This means that if all GameObjects you are proccessing are under a common parent there can never be more than one job. This applies even if the parent GameObject is not visible and has the identity transform.
    I suspect this is the limitation you are seeing in this case. In order to make it run in parallel you must split your GameObjects into multiple hierarchies.
    The same limitation of parallelizing the update applies to UpdateRenderBoundingVolumes, which is likely why it is not running in parallel.
    Calling Complete in the beginning of update will indeed limit your jobs to one frame, but that is what you want in most cases. In this case you can not have jobs spanning multiple frames since the rendering is using the transforms your jobs are writing and will wait for them.
     
  3. Gru

    Gru

    Joined:
    Dec 23, 2012
    Posts:
    142
    Hi Tim, thanks for the reply. That's exactly on point, I unparented the transforms and the jobs are now running on all the cores.
    I understand the limitation, so this brings me to another vote for that feature of "transformless folder-like editor-only game objects" that likely comes up from time to time :)
    I can also see how the reads could be incorrect if another thread is writing to the data at the time without using locks.

    All in all I'm looking forward to the rest of the new system. There's a lot of conceptual cleverness going into it.
     
  4. Peter77

    Peter77

    QA Jesus

    Joined:
    Jun 12, 2013
    Posts:
    6,618
    Many projects use identity transforms to organize the hierarchy. UT did write some blog posts about this, saying it's bad practice, but it's what you'll often find in the wild.

    If I understand you correctly, such setup will completely negate any performance benefits of IJobParallelForTransform and those objects should be root-objects instead?
     
  5. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    5,203
    IJobParallelForTransform runs asynchronous to the main thread until anyone else needs to access it from another job. This alone is a huge optimization opportunity. Jobs going wide is far from the only reason to use jobs...

    If you really want to get to next level performace in terms for Transform Components then the upcoming preview of the Entity Component System is the safety best.
     
    Gru and hippocoder like this.
  6. Gru

    Gru

    Joined:
    Dec 23, 2012
    Posts:
    142
    The limitation is there but personally this one doesn't bother me much.

    Here is my understanding about the "bad practice" part: if we instantiate the objects at runtime (which will surely be the case with these kinds of 10,000 objects hierarchies) and we parent the transforms under the common root, then all the transforms that come after that root in the hierarchy have to be shuffled in memory, causing possible hiccups at instantiation time. Another issue is sending the transform changed events to all parents upwards in the hierarchy, as detailed here. The way to resolve these is to have the #if in the spawner script and do the parenting only if we are in player (not Build); OR similarly do the automatic unparent in Build only if the objects are already pre-instantiated in the scene.

    There is also another way to do it without unparenting: We can switch from one common root to say 16 common roots, so we can get 16x level parallelism while keeping the hierarchy moderately clean.

    I can give my performance results in the following setup: I instantiate a bunch of cubes and rotate them every frame just like in the presentation. I put the camera in the center of the spawner to only render a portion of the cubes, otherwise I will get rendering bound very quickly.
    (Without the new system:)
    * Default Solution 120 FPS: the code with separate MBs, each one updates itself.
    * Update Triggered by Manager 170 FPS: all the cubes have their own rotation data and updating method, however it is not triggered by Update, instead the manager calls their update. This is a known way to handle things, it avoids the cost of native-managed bridge that is paid for separate updates.
    * Update Inside the Manager 170 FPS: the update code has been moved from the cube's MB to the manager. I suspect the performance is the same because in order to get the transforms we have to chase the reference in memory for the transform, the same amount of work we have to do in the previous case.

    (With the new system:)
    * Access through TransformAccessArray indexer 160 FPS: I tried the same setup from above, but instead of calling indexer on a C# List I tried indexing from the TransformAccessArray. I suspect the performance drop due to more indirections.
    * IJobParallelForTransform, only on one thread 188 FPS: There is a benefit from 170 to 188 FPS in this artificial case, but it would presumably be even higher if the main thread was doing other things. So, here is the benefit of offloading work off the main thread.
    * Unparented IJobParallelForTransform 230 FPS: As above, but unparented to achieve parallelism. The performance increase is significant, but there is still the price to pay for actual rendering.
     
    MechEthan likes this.
  7. snacktime

    snacktime

    Joined:
    Apr 15, 2013
    Posts:
    3,356
    haven't tested it as I don't see an obvious difference in my scenario, but I wonder if marking them as dontdestroyonload has any negative effects. Was curious because if so that would suck, I do a lot of game level object pooling.
     
  8. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    5,203
    I should have been clearer. You are absolutely right. That is in fact the best layout for performance.

    Internally when there is a root game object, we create a dedicated hierarchy. Each hierarchy owns its own JobHandle. Thus further jobs are scheduled against it. Grabbing position of a single transform on the main thread waits on all jobs against that specific hierarchy.

    Tons of 1 game object is not the best layout for TransformHierarchy speed because the position data will be scattered in memory, grouping things in batches improves both scheduling cost & memory access patterns on the job too.
     
  9. pvloon

    pvloon

    Joined:
    Oct 5, 2011
    Posts:
    591
    What I would love is we could manually mark transforms as being "roots" (aka, having their own memory buffer & job handle). It can be hard and disorganizing to keep everything at a scene root level. It is often necessary to get some common enable/disable or translation state on otherwise separate groups For example, many streamed games still keep entire regions under one game object because that makes it easy to enable / disable those regions.

    We often know more about the structure of our hierarchy and how memory should be batched, but right now it can be quite cumbersome refactoring to separate out these hierarchies to be at the unity scene root.
     
  10. MadeFromPolygons

    MadeFromPolygons

    Joined:
    Oct 5, 2013
    Posts:
    3,982
    @Joachim_Ante if there is some way to do this without massive perf impacts or stability problems, this sounds like a really nice and intuitive feature
     
  11. snacktime

    snacktime

    Joined:
    Apr 15, 2013
    Posts:
    3,356
    Create an abstraction for parenting, and have it partition by scope. And then be able to enable/disable by scope. Something like this for an api.

    Code (csharp):
    1.  
    2. SetParent(GameObject gameObject, Transform Parent, PartitionName partitionName);
    3. Activate(PartitionName partitionName);
    4. Deactivate(PartitionName partitionName);
    5.  
    In development just have a flag where it doesn't partition, but uses the passed in parent.
     
  12. snacktime

    snacktime

    Joined:
    Apr 15, 2013
    Posts:
    3,356
    FYI here is a partitioning object pool that handles gameobjects with a single renderer or complex particle effects. You can ask it to release in X number of seconds at the time you request the object, and a helpful feature to pass it some other object where it will see if it has a pooled object as a child and remove it. Useful when you want to make sure some object you are destroying doesn't have a pooled object as a child.

    https://gist.github.com/gamemachine/66663c5460e6aa5b99be138c4a3de341
     
  13. IsaiahKelly

    IsaiahKelly

    Joined:
    Nov 11, 2012
    Posts:
    418
    @Peter77 @Gru @Joachim_Ante Having some kind of cosmetic way to organize the hierarchy would be really nice. Like some kind of dummy object with no transform. Call it a "hierarchy/scene folder" and give it a special icon. However, can you not just use multiple, additive scenes as a replacement for huge root objects? Is there any kind of performance disadvantage with going that way?

    If you're "streaming" entire regions, why wouldn't you use additive scenes instead? I thought that was their whole purpose!
     
    Last edited: Mar 11, 2018
  14. snacktime

    snacktime

    Joined:
    Apr 15, 2013
    Posts:
    3,356
    They already do that with objects marked as don't destroy, so it is possible.