Search Unity

  1. Unity 2019.2 is now released.
    Dismiss Notice

How improve jobs performace

Discussion in 'Data Oriented Technology Stack' started by dreamerflyer, Jan 21, 2018.

  1. dreamerflyer

    dreamerflyer

    Joined:
    Jun 11, 2011
    Posts:
    901
    I make 10000 pic boxes ,but the fps very low and memory is so high.....any hint to improve that? boxes.jpg
     

    Attached Files:

  2. dreamerflyer

    dreamerflyer

    Joined:
    Jun 11, 2011
    Posts:
    901
    Code (CSharp):
    1. using UnityEngine;
    2. using Unity.Collections;
    3. using Unity.Jobs;
    4. class move : MonoBehaviour
    5. {
    6.     struct VelocityJob : IJob
    7.     {
    8.         // Jobs declare all data that will be accessed in the job
    9.         // By declaring it as read only, multiple jobs are allowed to access the data in parallel
    10.         [ReadOnly]
    11.         public NativeArray<Vector3> velocity;
    12.         // By default containers are assumed to be read & write
    13.         public NativeArray<Vector3> position;
    14.         // Delta time must be copied to the job since jobs generally don't have concept of a frame.
    15.         // The main thread waits for the job on the same frame or the next frame, but the job should
    16.         // perform work in a deterministic and independent way when running on worker threads.
    17.         public float deltaTime;
    18.         // The code actually running on the job
    19.         public void Execute()
    20.         {
    21.             // Move the positions based on delta time and velocity
    22.             for (var i = 0; i < position.Length; i++)
    23.                 position[i] = position[i] + velocity[i] * deltaTime;
    24.         }
    25.     }
    26.     public GameObject ball;
    27.     Transform[] objs;
    28.     int objNum = 10000;
    29.  
    30.     private void Start()
    31.     {
    32.      
    33.         objs = new Transform[objNum];
    34.         for (int i = 0; i < objNum; i++)
    35.         {
    36.             GameObject obj = GameObject.Instantiate(ball, new Vector3(i * 1, 0, 0), Quaternion.identity
    37.                                                    );
    38.             objs[i] = obj.transform;
    39.        }
    40.     }
    41.     public void Update()
    42.     {
    43.         var position = new NativeArray<Vector3>(objNum, Allocator.Persistent);
    44.         var velocity = new NativeArray<Vector3>(objNum, Allocator.Persistent);
    45.         for (var i = 0; i < velocity.Length; i++)
    46.             velocity[i] = new Vector3(0, 10, 0);
    47.         // Initialize the job data
    48.         var job = new VelocityJob()
    49.         {
    50.             deltaTime = Time.deltaTime,
    51.             position = position,
    52.             velocity = velocity
    53.         };
    54.         // Schedule the job, returns the JobHandle which can be waited upon later on
    55.         JobHandle jobHandle = job.Schedule();
    56.         // Ensure the job has completed
    57.         // It is not recommended to Complete a job immediately,
    58.         // since that gives you no actual parallelism.
    59.         // You optimally want to schedule a job early in a frame and then wait for it later in the frame.
    60.         jobHandle.Complete();
    61.  
    62.         //    Debug.Log(job.position[0]);
    63.         for (int i = 0; i < objNum;i++)
    64.         {
    65.             objs[i].transform.position = job.position[i];//main thread to use worker thread data need after complet.
    66.  
    67.         }
    68.  
    69.         // Native arrays must be disposed manually
    70.         position.Dispose();
    71.         velocity.Dispose();
    72.     }
    73. }
    maybe suit for using IJobParallelForTransform?But can not find example in Manual..
     
  3. Peter77

    Peter77

    Joined:
    Jun 12, 2013
    Posts:
    4,104
    Profiling in the editor is often misleading, because the performance characteristics between a Player and Editor are significant. In the context of Job System, you probably want to become comfortable using the CPU Usage "Timeline" view. Using the Timeline view allows you to see how threads are utilized. Make sure to turn off Vsync if you profile btw.

    According to your screenshot, most time is being spent in rendering (the green area). Looking at the attached project, you instantiate thousands of GameObjects with a custom material, which doesn't seem to have GPU Instancing enabled. However, even if they would use GPU Instancing, using a lot of GameObjects has always been inefficient across various systems in Unity. This has always been the case in Unity.

    The few Job System Unite talks I watched, it seems to get the most out of the new tech, people move away from using GameObject's if massive amounts of entities are required, and use a custom solution (or the upcoming Entity Component System) to avoid the overhead that comes with GameObjects. I recommend to watch this talk.

    Let's move on to your specific example project.

    In the "move" MonoBehaviour, you end up doing a lot of things in Update, which runs on the main-thread. The only thing that runs on threads is the VelocityJob, which is just a bit of Vector math, which shouldn't be expensive to begin with.

    Here are a few tips:

    Code (CSharp):
    1. for (int i = 0; i < objNum;i++)
    2. {
    3.    objs[i].transform.position = job.position[i];
    4. }
    You're changing the Transform.position on the mainthread. Historically, this is the expensive part. You probably want to move this to threads. See IJobParallelForTransform and TransformAccessArray.

    Code (CSharp):
    1. var velocity = new NativeArray<Vector3>(objNum, Allocator.Persistent);
    2. for (var i = 0; i < velocity.Length; i++)
    3.    velocity[i] = new Vector3(0, 10, 0);
    4.  
    If you overwrite the whole array, avoid that Unity initializes each element with 0 by default. You can pass NativeArrayOptions.None to NativeArray() to do that.

    Rather than creating and setting up entire arrays every frame from scratch, you can also keep them around and only update the elements that changed.
     
    Last edited: Jan 21, 2018
  4. dreamerflyer

    dreamerflyer

    Joined:
    Jun 11, 2011
    Posts:
    901
    Thanks your reply,I do not know how to use this:IJobParallelForTransform.How to control Transform 's position by it?
     
  5. Peter77

    Peter77

    Joined:
    Jun 12, 2013
    Posts:
    4,104
    Code (CSharp):
    1. struct MyJob : UnityEngine.Jobs.IJobParallelForTransform
    2. {
    3.     public void Execute(int index, TransformAccess transform)
    4.     {
    5.         transform.position = <some value>;
    6.     }
    7. }
     
  6. dreamerflyer

    dreamerflyer

    Joined:
    Jun 11, 2011
    Posts:
    901
    TransformAccess is gameobject's transform?Can not find how to use this.
     
    Last edited: Jan 21, 2018
  7. Peter77

    Peter77

    Joined:
    Jun 12, 2013
    Posts:
    4,104
    Create a TransformAccessArray, add your Transform's to it and pass the TransformAccessArray to IJobParallelForTransform.Schedule().
     
  8. dreamerflyer

    dreamerflyer

    Joined:
    Jun 11, 2011
    Posts:
    901
    Any example?Thanks a lot~
     
  9. Enrico-Monese

    Enrico-Monese

    Joined:
    Dec 18, 2015
    Posts:
    56
    Edit: TransformAccessArray can only be used in IJobParallelForTransform, not in standard IJob

    When I try to use TransformAccessArray I get this error. It says you can disable this check using [NativeDisableUnsafePtrRestriction], but I have no clue where to put it :/
    InvalidOperationException: IKJob.transforms.m_TransformArray uses unsafe Pointers which is not allowed. Unsafe Pointers can lead to crashes and no safety against race conditions can be provided.
    If you really need to use unsafe pointers, you can disable this check using [NativeDisableUnsafePtrRestriction].
    Unity.Jobs.LowLevel.Unsafe.JobsUtility.CreateJobReflectionData (System.Type type, System.Object managedJobFunction0, System.Object managedJobFunction1, System.Object managedJobFunction2) (at /Users/builduser/buildslave/unity/build/Runtime/Jobs/ScriptBindings/Jobs.bindings.cs:74)
    Unity.Jobs.IJobExtensions+JobStruct`1[T].Initialize () (at /Users/builduser/buildslave/unity/build/Runtime/Jobs/Managed/IJob.cs:22)
    Unity.Jobs.IJobExtensions.Schedule[T] (T jobData, Unity.Jobs.JobHandle dependsOn) (at /Users/builduser/buildslave/unity/build/Runtime/Jobs/Managed/IJob.cs:35)
    Edraflame.Arm.IKasync.Update () (at Assets/Scripts/Arm/IKasync.cs:72)
     
    Last edited: Jan 23, 2018
    Streamfall likes this.
  10. Peter77

    Peter77

    Joined:
    Jun 12, 2013
    Posts:
    4,104


    I'm not an expert of the new Job System tech, but here is how I would approach this...
    Code (CSharp):
    1. using UnityEngine;
    2. using Unity.Collections;
    3. using Unity.Jobs;
    4. using UnityEngine.Jobs;
    5.  
    6. class MyJobExample : MonoBehaviour
    7. {
    8.     struct MyJob : UnityEngine.Jobs.IJobParallelForTransform
    9.     {
    10.         [ReadOnly]
    11.         public NativeArray<Vector3> velocity;
    12.  
    13.         [ReadOnly]
    14.         public float deltaTime;
    15.  
    16.         public void Execute(int index, TransformAccess transform)
    17.         {
    18.             var step = velocity[index] * deltaTime;
    19.  
    20.             transform.position = transform.position + step;
    21.             transform.localRotation *= Quaternion.Euler(step * 100);
    22.         }
    23.     }
    24.  
    25.     int m_NumberOfCubes = 10000;
    26.     TransformAccessArray m_TransformAccessArray;
    27.     NativeArray<Vector3> m_VelocityArray;
    28.  
    29.     void Awake()
    30.     {
    31.         Physics.autoSyncTransforms = false;
    32.         Physics2D.autoSyncTransforms = false;
    33.  
    34.         m_TransformAccessArray = new TransformAccessArray(m_NumberOfCubes);
    35.  
    36.         m_VelocityArray = new NativeArray<Vector3>(m_NumberOfCubes, Allocator.Persistent, NativeArrayOptions.None);
    37.         for (var i = 0; i < m_VelocityArray.Length; i++)
    38.             m_VelocityArray[i] = Random.insideUnitSphere;
    39.  
    40.         // Using massive amounts of GameObject's is inefficient!
    41.         for (int i = 0; i < m_NumberOfCubes; i++)
    42.         {
    43.             var obj = GameObject.CreatePrimitive(PrimitiveType.Cube);
    44.             obj.transform.position = Random.insideUnitSphere * 100;
    45.  
    46.             var collider = obj.GetComponent<Collider>();
    47.             if (collider != null)
    48.                 Destroy(collider);
    49.  
    50.             m_TransformAccessArray.Add(obj.transform);
    51.        }
    52.     }
    53.  
    54.     void OnDestroy()
    55.     {
    56.         // Make sure to release native resources
    57.     }
    58.  
    59.     void Update()
    60.     {
    61.         var job = new MyJob()
    62.         {
    63.             deltaTime = Time.deltaTime,
    64.             velocity = m_VelocityArray
    65.         };
    66.  
    67.         job.Schedule(m_TransformAccessArray);
    68.     }
    69. }
    This is running on a Desktop PC from 2008:
    jobs_profiler.png

    I don't understand why "ParticleSystem" takes up so much time, because there is no ParticleSystem in the scene.
     
    Last edited: Jan 21, 2018
    laurentlavigne and dreamerflyer like this.
  11. dreamerflyer

    dreamerflyer

    Joined:
    Jun 11, 2011
    Posts:
    901
    10000 pic balls ,performance still not good. Memory still very big...and fps is so low
    m.jpg cpu.jpg
     
  12. Peter77

    Peter77

    Joined:
    Jun 12, 2013
    Posts:
    4,104
    I can only repeat what I already wrote earlier:
    https://forum.unity.com/threads/how-improve-jobs-performace.513536/#post-3362057

    Profiling the game running inside in the editor is often misleading, because the performance characteristics between a Player and Editor are significant.

    According to the screenshot, most time is being spent in rendering (the green area). Using a lot of GameObject's has always been inefficient across various systems in Unity.

    It seems to get the most out of the new tech, people move away from using GameObject's if massive amounts of entities are required, and use a custom solution (or the upcoming Entity Component System) to avoid the overhead that comes with GameObject's. I recommend to watch this talk.

    What you want to do is optimize the rendering part, that's where you can get the most performance improvement.
     
  13. laurentlavigne

    laurentlavigne

    Joined:
    Aug 16, 2012
    Posts:
    2,027
    16M triangles is a lot.
    10000 gameobjects is also a lot, for this I'd wait for the ECS
     
  14. richardkettlewell

    richardkettlewell

    Unity Technologies

    Joined:
    Sep 9, 2015
    Posts:
    1,185
    I’d love to see a bug report for this - would you mind filing one?
     
    Peter77 likes this.
  15. dreamerflyer

    dreamerflyer

    Joined:
    Jun 11, 2011
    Posts:
    901
    About ECS i think it only is a logic design pattern like MVC,not improve the performance,only flexile about changed matching the require.And how about the Memory? Gpu instancing is using the same mesh memory,why so big?
     
  16. dreamerflyer

    dreamerflyer

    Joined:
    Jun 11, 2011
    Posts:
    901
    I think ECS can not improve this better.emmm....Maybe triangles is a lot
     
  17. Peter77

    Peter77

    Joined:
    Jun 12, 2013
    Posts:
    4,104
    Sure, thanks for asking! Here is the Case number and I also created a separate forum thread.

    (Case 990576) 2018.1: ParticleSystem cost, without ParticleSystem in Scene
    https://forum.unity.com/threads/cas...-cost-without-particlesystem-in-scene.513665/

    You can use a Profiler to find out:
    https://docs.unity3d.com/Manual/ProfilerMemory.html
    https://bitbucket.org/Unity-Technologies/memoryprofiler
     
    richardkettlewell likes this.
  18. Roni92pl

    Roni92pl

    Joined:
    Jun 2, 2015
    Posts:
    268
    Did you read that comment? You don't allow any parallelization by forcing complete() on job immediately after starting it. I would yield one frame after shedule, that would give plenty of time for job to complete while main thread would just wait for the result. Just make sure to not overlap while using coroutines.
    Btw, as it was mentioned before, your cpu is busy mostly by rendering, not your rotating jobs.
     
  19. hippocoder

    hippocoder

    Digital Ape Moderator

    Joined:
    Apr 11, 2010
    Posts:
    25,578
    This feature had better come with some very good documentation or it'll sadly be misunderstood quite easily.
     
  20. dreamerflyer

    dreamerflyer

    Joined:
    Jun 11, 2011
    Posts:
    901
    OK, and Do you have some idea about using physics ,raycast and navmesh path finding using this jobs?
     
  21. laurentlavigne

    laurentlavigne

    Joined:
    Aug 16, 2012
    Posts:
    2,027
    Navmesh and raycast are coming soon to a job near you.
     
  22. dreamerflyer

    dreamerflyer

    Joined:
    Jun 11, 2011
    Posts:
    901
    coming soon...this weekend?
     
  23. dadude123

    dadude123

    Joined:
    Feb 26, 2014
    Posts:
    787
    You have this in your code:
    Code (csharp):
    1.  
    2.  
    3. [LIST=1]
    4. [*]    void Update()
    5. [*]    {
    6. [*]        var job = new MyJob()
    7. [*]        {
    8. [*]            deltaTime = Time.deltaTime,
    9. [*]            velocity = m_VelocityArray
    10. [*]        };
    11. [*]
    12.  
    13. [*]        job.Schedule(m_TransformAccessArray);
    14. [*]    }
    15. [*]
    [/LIST]
    But wouldn't that schedule the job every frame even when the previous job has not yet completed?
     
  24. Krajca

    Krajca

    Joined:
    May 6, 2014
    Posts:
    91
    Can I populate somehow TransformAccessArray m_TransformAccessArray; without creating gameobject?
     
  25. dadude123

    dadude123

    Joined:
    Feb 26, 2014
    Posts:
    787
    A transform is a special thing that also has a parent and children.

    If you are only interested in Position, Rotation, Scale, then a Matrix4x4 is the thing you want (it's what Transform uses internally most likely)
     
  26. Peter77

    Peter77

    Joined:
    Jun 12, 2013
    Posts:
    4,104
    I think so, but hard to tell without documentation. Perhaps jobs are forced to complete on frame end?
     
  27. dadude123

    dadude123

    Joined:
    Feb 26, 2014
    Posts:
    787
    Can't be, or can it? I mean then you would not be able to have jobs that take multiple frames to complete.

    Assuming you have a job that is supposed to be done exactly once every frame, then my guess is that you actually need to do your setup (job scheduling) as early as possible (using the Script Execution Order settings). Of course you maybe need to wait for some other stuff so you can't start right away at the start of a frame (maybe you still need to wait for physics or so).

    And then you need some point where you collect the results back (aka integrate).
    Which would most likely be LateUpdate() or (if you have a dedicated script for that which is setup with script execution order to run very late) in Update() you call .Complete() on your job handles and then apply the results (setting all the transforms you modified, or applying that texture you worked on, or...)


    At least that's what I'm guessing.
     
  28. Krajca

    Krajca

    Joined:
    May 6, 2014
    Posts:
    91
    Thanks!
    I made small demo - I've combined code from this topic with this:
    https://forum.unity.com/threads/how-to-draw-mesh-in-the-job-system.513615/
    and I get improvement from ~110 to ~160 fps just from changing from Tranform to Matrix4x4
    Code (CSharp):
    1. using UnityEngine;
    2. using Unity.Jobs;
    3. using Unity.Collections;
    4. using UnityEngine.Jobs;
    5.  
    6. public class JobRender : MonoBehaviour
    7. {
    8.     struct ComputeJob : IJobParallelFor
    9.     {
    10.         [ReadOnly] public float deltaTime;
    11.         [ReadOnly] public float scale;
    12.  
    13.         [ReadOnly] public NativeArray<Vector3> velocity;
    14.  
    15.         public NativeArray<Matrix4x4> outputMatrix;
    16.  
    17.         public void Execute(int i)
    18.         {
    19.             var step = velocity[i] * deltaTime;
    20.  
    21.             var pos = new Vector3(outputMatrix[i][0, 3], outputMatrix[i][1, 3], outputMatrix[i][2, 3]) + step;
    22.             var rot = outputMatrix[i].rotation * Quaternion.Euler(step * 100);
    23.  
    24.             outputMatrix[i] = Matrix4x4.TRS(pos, rot, Vector3.one * scale);
    25.         }
    26.     }
    27.  
    28.     public Mesh mesh;
    29.     public int computeSize = 10000, batchSize = 100;
    30.     public float scale = 3;
    31.     public Material mat;
    32.  
    33.     NativeArray<Matrix4x4> output;
    34.     JobHandle handleCalculate;
    35.  
    36.     Matrix4x4[] matrices;
    37.     NativeArray<Vector3> m_VelocityArray;
    38.  
    39.     void OnEnable()
    40.     {
    41.         output = new NativeArray<Matrix4x4>(computeSize, Allocator.Persistent);
    42.         matrices = new Matrix4x4[computeSize];
    43.  
    44.         m_VelocityArray = new NativeArray<Vector3>(computeSize, Allocator.Persistent, NativeArrayOptions.None);
    45.         for (var i = 0; i < m_VelocityArray.Length; i++)
    46.         {
    47.             m_VelocityArray[i] = UnityEngine.Random.insideUnitSphere;
    48.             output[i] = Matrix4x4.TRS(UnityEngine.Random.insideUnitSphere * 100, Quaternion.identity, Vector3.one);
    49.         }
    50.         output.CopyTo(matrices);
    51.     }
    52.  
    53.     void OnDisable()
    54.     {
    55.         handleCalculate.Complete();
    56.         output.Dispose();
    57.         m_VelocityArray.Dispose();
    58.     }
    59.  
    60.     void Update()
    61.     {
    62.         var jobA = new ComputeJob()
    63.         {
    64.             deltaTime = Time.deltaTime,
    65.             scale = scale,
    66.  
    67.             velocity = m_VelocityArray,
    68.             outputMatrix = output
    69.         };
    70.  
    71.         handleCalculate = jobA.Schedule(computeSize, batchSize);
    72.     }
    73.  
    74.     private void LateUpdate()
    75.     {
    76.         handleCalculate.Complete();
    77.  
    78.         output.CopyTo(matrices);
    79.         Graphics.DrawMeshInstanced(mesh, 0, mat, matrices, 1023);
    80.     }
    81. }
     
  29. Krajca

    Krajca

    Joined:
    May 6, 2014
    Posts:
    91
    I think support for native arrays in Graphics.DrawMeshInstanced would be nice - one huge copy operation less.
     
    laurentlavigne likes this.
  30. dyox

    dyox

    Joined:
    Aug 19, 2011
    Posts:
    542
    Hi, i've changed this script to not use NativeArray, Job.Complete and deltaTime.
    Here is my implementation, LateUpdate now take 0ms and jobs can have independant deltaTime :
    I was able to compute/draw 1 024 000 cubes without lags (divided into 1024 batches)

    [Update] Fixed code from @laurentlavigne

    As delta time is only time elapsed between 2 frames, i'm using a StopWatch, so each thread/job execute have his own deltaTime, and cubes in game are rotating smoothly.

    And Profiler : (only 160fps due to my graphic card, i'm sure that someone with a normal card can have a better fps)
    Desktop Screenshot 2018.01.23 - 17.18.09.42.png
     
    Last edited: Jan 23, 2018
  31. laurentlavigne

    laurentlavigne

    Joined:
    Aug 16, 2012
    Posts:
    2,027
    Doesn't work. I don't think you can access a static from within a job, especially a reftype like Instance.matrices
     
  32. Krajca

    Krajca

    Joined:
    May 6, 2014
    Posts:
    91
    But in your code you draw only 1023 cubes.
     
  33. dyox

    dyox

    Joined:
    Aug 19, 2011
    Posts:
    542
    It works, i'm using jobs with instance and ref on many projects.
    The limitation is because it use Struct. But structs can access to any system, it's just how C# works.

    Yes, simply add a new dimension to matrices and velocity. 1000 *1024
    matrices = new Matrix[1000][1024];

    Also Native array are slowest than array on Mono (about 40ms vs 8ms for array on my benchmark) (not sure for IL2CPP).
    So it's possible to use any array or list or anything inside jobs, allocation for the moment works but is not allowed and cause a lag, but i'm waiting feedback from unity team.
    As i'm working on a lot of data, i've rewrote a list system with pooled array at given interval size 8,16,32,64,etc. Before calling any jobs, i'm filling all list pool to XXX elements so i can use jobs working with List and re-sizable array without any alloc inside.
     
    Last edited: Jan 23, 2018
  34. snacktime

    snacktime

    Joined:
    Apr 15, 2013
    Posts:
    2,394
    Ya I don't really get NativeArray. While you could be writing to data concurrently with jobs, that seems to be an anti pattern. I would treat it more like map/reduce with jobs working on discrete chunks of data split out to a good level of parallelism. So as long as you just don't try to access that data from the main thread until the job is done, there is no need to use cpu instruction based atomic updates/reads, which I'm just guessing is what they might be doing that causes the performance difference.

    Although if you are using a map/reduce approach with the correct width, the performance issue is going to be mitigated as far as time to completion goes.
     
  35. laurentlavigne

    laurentlavigne

    Joined:
    Aug 16, 2012
    Posts:
    2,027
    the script doesn't work because you forgot indexes all over the place
    example:
    Code (CSharp):
    1.         m_VelocityArray = new NativeArray<Vector3>(computeSize, Allocator.Persistent, NativeArrayOptions.None);
    2.         for (var i = 0; i < m_VelocityArray.Length; i++)
    3.         {
    4.             m_VelocityArray = UnityEngine.Random.insideUnitSphere;
    I was wondering why you bypass the unity job design pattern. NA much slower than mono arrays is odd given the focus on performance. Looking forward to the team's answer on that.

    EDIT: After fixing your code and trying out your way, I have to say your code is simpler because you don't need to feed the job or bother about when the job is done, you just use the data as it's being changed by the job. I like that so I'll do things this way.
     
    Last edited: Jan 23, 2018
  36. laurentlavigne

    laurentlavigne

    Joined:
    Aug 16, 2012
    Posts:
    2,027
    version based on @dyox that works and displays computesize cubes

    Code (CSharp):
    1. using UnityEngine;
    2. using Unity.Jobs;
    3. using Unity.Collections;
    4. using UnityEngine.Jobs;
    5.  
    6. public class JobRenderDyox : MonoBehaviour
    7. {
    8.     static public JobRenderDyox Instance;
    9.  
    10.     struct ComputeJob : IJobParallelFor
    11.     {
    12.         public void Execute(int i)
    13.         {
    14.             var step = Instance.m_VelocityArray[i] * ((float)Instance.Watch.ElapsedMilliseconds / 1000f);
    15.  
    16.             Instance.matrices[i].SetTRS(new Vector3(Instance.matrices[i][0, 3], Instance.matrices[i][1, 3], Instance.matrices[i][2, 3]) + step,
    17.                 Instance.matrices[i].rotation * Quaternion.Euler(step * 100), Vector3.one * Instance.scale);
    18.         }
    19.     }
    20.  
    21.     public Mesh mesh;
    22.     public int computeSize = 1023, batchSize = 100;
    23.     public float scale = 3;
    24.     public Material mat;
    25.     public System.Diagnostics.Stopwatch Watch = new System.Diagnostics.Stopwatch();
    26.  
    27.     JobHandle handleCalculate;
    28.  
    29.     Matrix4x4[] matrices, matrics1023;
    30.     NativeArray<Vector3> m_VelocityArray;
    31.  
    32.     void OnEnable()
    33.     {
    34.         Instance = this;
    35.         matrices = new Matrix4x4[computeSize];
    36.         matrics1023 = new Matrix4x4[Mathf.Min(computeSize, 1023)];
    37.  
    38.         m_VelocityArray = new NativeArray<Vector3>(computeSize, Allocator.Persistent, NativeArrayOptions.None);
    39.         for (var i = 0; i < m_VelocityArray.Length; i++)
    40.         {
    41.             m_VelocityArray[i] = UnityEngine.Random.insideUnitSphere;
    42.             matrices[i].SetTRS(UnityEngine.Random.insideUnitSphere * 100, Quaternion.identity, Vector3.one);
    43.         }
    44.     }
    45.  
    46.     void OnDisable()
    47.     {
    48.         handleCalculate.Complete();
    49.         m_VelocityArray.Dispose();
    50.     }
    51.  
    52.     void Update()
    53.     {
    54.         Instance = this;
    55.  
    56.         if (handleCalculate.IsCompleted)
    57.         {
    58.             Watch.Stop();
    59.             Watch.Reset();
    60.             Watch.Start();
    61.  
    62.             var jobA = new ComputeJob()
    63.             {
    64.             };
    65.  
    66.             handleCalculate = jobA.Schedule(computeSize, batchSize);
    67.         }
    68.     }
    69.  
    70.     private void LateUpdate()
    71.     {
    72.         for (int i = 0; i < matrices.Length; i+=1023)
    73.         {
    74.             System.Array.ConstrainedCopy(matrices, i, matrics1023, 0, Mathf.Min(1023, matrices.Length - i));
    75.             Graphics.DrawMeshInstanced(mesh, 0, mat, matrics1023, 1023);
    76.         }
    77.     }
    78. }
    79.  
    performances swallowed by the constrained copy I think, like someone else said, we really need matrix index on DrawMeshInstanced
    Code (CSharp):
    1. Graphics.DrawMeshInstanced(mesh, 0, mat, matrices, index, length)


    but compared to the other way of doing things... much better!

     
    Last edited: Jan 23, 2018
    dyox likes this.
  37. dyox

    dyox

    Joined:
    Aug 19, 2011
    Posts:
    542
    Ha yes nice, sorry it's from the copy/past. Something was wrong and removed all [*i*]

    But yes, data is read by the Main Thread only and other threads fill matrice row by row, so we don't need any lock etc. (this is not true all times, but in this cases, it's useless to check data because it's updated too often)

    Yes indeed a start index or batch count (from 0 to end of array with batchcount size (max 1023)).
    Example if we work on a 2D array or with a massive amount of data. Spliting DrawMeshInstance into batches :

    Matrix array = new Matrix[126 000];
    Graphics.DrawMeshInstanced(array,ushort batchcount (1000),mesh); Etc..

    (thinking about LOD for speedtrees etc.., drawing meshes grouped by 1000 elements + array.Length % batchcount)
     
    Last edited: Jan 24, 2018
  38. laurentlavigne

    laurentlavigne

    Joined:
    Aug 16, 2012
    Posts:
    2,027
    I understand now, and we only read from the same index as well so there is no unpredictability - very cool

    Memory transfer becomes the bottleneck in these large sets.
    In this other proto, a compute shader calculates diffusion on large grids, it is very very fast at doing so but it takes too long to bring the data back to the cpu -- I guess that's what those 10ms rendertexture.setactive are (since there is no rendertexture code anymore). So I'm about to convert the compute shader into c#...
     
  39. dyox

    dyox

    Joined:
    Aug 19, 2011
    Posts:
    542
    I think you should take a look at any AsyncGPUCallback system (unity beta or the DX plugin on GIT).
    I'm generating a real sized planet at 160fps on GPU and read data back to CPU to create mesh without any lag.

    CPU Generate Octree -> GPU Generate/Read HeightMap -> CPU Asyncwait ReadBack -> Jobs Generate Meshes -> MainThread Apply Mesh. (Total Elapsed < 2s)
    It's possible to request data from compute or use ReadPixels with async, the fastest is ReadPixels and it works on old platforms.
    (Also it's possible to share the same rendertexture with different things, like using one rendertexture of 1024*1024 to generate 4 heightmaps of 512*512 in one row) (batching octree requests)
     
    Last edited: Jan 24, 2018
    Enrico-Monese likes this.
  40. laurentlavigne

    laurentlavigne

    Joined:
    Aug 16, 2012
    Posts:
    2,027
    nice pipeline!

    readpixels is faster than async? that's not supposed to happen :D
    I've used async a lot and I'm seeing spikes but they seem to be normal bandwidth limitation. How much data are you sending to the gpu and reading back?

    I'm about to abandon compute shader because of transfer speed, did you try to generate the heighmap on a job?
     
    dyox likes this.
  41. dyox

    dyox

    Joined:
    Aug 19, 2011
    Posts:
    542
    No, async ReadPixels is fastest with Async GetData than ComputeShader async getdata.

    I'm sending -> reading 256*256*60 heightmap + 60*128*128 normal map per frame using rendertexture/asyncreadpixels per frame at 160 fps
    Desktop Screenshot 2018.01.24 - 01.59.39.26.png

    Yes : (4096*4096 with biomes)
    -Unity 5.6 : 41s Mono 3.5
    -Unity 2018 : 34s Mono 3.5 Jobs
    -Unity 2018 : 21s IL2CPP+Jobs

    Complete Planet
    -GPU RenderTexture : 2s (be carefull with texture format, i'm using alpha8 or RGB32 depending of target heightmap/normal)
     
    Last edited: Jan 24, 2018
  42. laurentlavigne

    laurentlavigne

    Joined:
    Aug 16, 2012
    Posts:
    2,027
    so does this mean I need to format my compute shader to use RWTexture instead of RWStructuredBuffer?

    7864320 bytes/frame? that's 1.2GB/s, seems like a lot, I'm reading 600x600x4x4 and I see it on the profiler. I'll try your rendertexture, I guess you apply a traditional shader on it, not a compute shader?

    what a difference! I remember now that I stopped using rendertextures because I couldn't figure out texture format, that's why I switched to compute buffer get data
     
  43. dyox

    dyox

    Joined:
    Aug 19, 2011
    Posts:
    542
    Ha yes sorry indeed. It's not 60x but a lot less. I've just added log for the test but it count the render call number and async wait.So divide it by 4 or more (depending of used noises count)
     
    laurentlavigne likes this.
  44. dyox

    dyox

    Joined:
    Aug 19, 2011
    Posts:
    542
    I'm not sure, but from what i know :

    NativeArray :
    -Allocation/Dispose Memory control
    -Do not use Garbage Collector (less GC Spike)
    -Ready for the future Job Compiler optimizations
    -Fastest access on IL2CPP (no bounds check ?)
    -Multithreading safe system : in editor conflict information ([readonly][writeonly])
     
    laurentlavigne likes this.
  45. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    4,712
    So in the editor NativeArray does extra checks to prevent race conditions in the editor. (Is there a job writing to the same array that i am currently reading from) Also mono has JIT optimizations for builtin array reads. So in mono, performance of builtin arrays is faster than in NativeArray.

    When we deploy a player however, we strip out all the checks. We are aiming to make IL2CPP performance of NativeArray faster than builtin arrays. In mono JIT, it would be nice if we get there but that will take a lot longer.

    In the long run, both are not very important. Because really all performance sensitive code will be compiled with burst. When comparing NativeArray in burst vs arrays in mono, we generally see massive speedups.

    So its unfortunate that NativeArray is not as fast as arrays in mono specifically, but when deployed to a the game via IL2CPP it will be better than builtin arrays. And with burst coming up we will naturally see massive speedups and in burst you can only use NativeArray, builtin arrays are not supported in burst.

    Right now static variables are not checked in C# jobs via static analysis, clearly thats a hack and quite dangerous. In the future it will be checked and will give you a compile error. So I think the only sensible thing to do is to use NativeArray already now.

    For editor only main thread code, that code only runs in mono of course so there to get speedups you need burst. This is a bit unfortunate until burst ships, but I think you just have to trust me that burst will solve and kick some performance ass.
     
  46. dadude123

    dadude123

    Joined:
    Feb 26, 2014
    Posts:
    787
    Is the burst compiler different from that? Or does it require IL2CPP?
    In my game I can not use IL2CPP because I need to be able to load DLLs, and compile code at runtime; and I don't think these limitations will disappear anytime soon. (And my requirements won't change either, dynamic code compiling is the basis for many unique features in my game)

    Will the burst compiler work in normal builds?
     
  47. dyox

    dyox

    Joined:
    Aug 19, 2011
    Posts:
    542
    Hi, so you're saying that all algorithms using lookup tables and used by default with multi threading will be not allowed and all usage of static field inside a normal C# code/environment will create a compiler error (on unity).

    So my question is :
    -When working on a project using C#, we will need to change and adapt our code and threading depending if : it's on unity or on dedicated server using C#.
    -All algorithms using look up tables will need to have NativeArray or NativeList, even if access speed is slowest than simple array?
    -On mono, JIT create an allocation on first access to the field (example byte[][]). So we need to create all static field and lookup tables with native array even if it slow down all projects on Mono or worse, copy data to jobs from the main thread ?
    -If we don't use IL2CPP, we will need to stay on mono with native array to use jobs ?
    -Jobs will work and use XX threads : (XX * CpuCount - 1 for jobs) + (System.Thread or Task), total number of threads will decrease the performance of the game, and impact FPS. Why do not provide an access to jobs and allow to use correctly all existing threads of unity ?
    -Jobs are made for performance, but all native array are not performant at all (40ms vs 8ms : NativeArray/Array Mono).
    -No allocation possible inside a job ? What is the point to have a multithreading system that can not interact with any other system ? And why limit jobs times to 4 frames. Is it not possible to dispatch jobs execute() on multiple frames ?

    If we need to send data to jobs, we have only one choice : Copy data at job start from main thread from existing systems and count frames from job start to 4 and call Job.Complete() ?
     
    Last edited: Jan 25, 2018
  48. dadude123

    dadude123

    Joined:
    Feb 26, 2014
    Posts:
    787
    You can use a using statement: "using Dictionary = NativeHashMap;" if you are concerned about cross-platform code.

    But he just said that in a build those differences disappear.

    From what I understand you always need to use native arrays if you want to use the jobs system.
    Creating a build with IL2CPP will not magically make that requirement go away, right?

    The point is to get multi-core utilization as well as massive speedups. The NativeArray access times are just slower in the editor to provide all sorts of useful checks.
    Also, you are not limited to 4 frames at all, just use persistent allocation mode.

    Not being able to interact with other systems is actually not at all unreasonable. The whole point of this thing is to (eventually, with the ECS) utilize processor cachelines in the most efficient way possible. So using other data as well (so meaning non-linear memory layout) is completely counter-productive.
     
    Krajca likes this.
  49. Peter77

    Peter77

    Joined:
    Jun 12, 2013
    Posts:
    4,104
    Is this true even if the Player is marked with the "Development" flag?

    If threading issues cause the Player to hang or silently crash, it would me more valuable to us to have a "Development" Player with slower performance, but proper error reporting.

    Our QA is testing on the target hardware, they do not play the game in the editor. Programmers might not play enough of the game in the editor to run in specific cases that would trigger a threading issue.

    It would be counter productive for us if QA suddenly starts reporting many crash/freeze bugs in "Development" builds, perhaps even without error details due to lack of internal engine checks, since it would block their work and debugging Player only crashes is often more difficult.

    In such case, we'd perhaps need to find a new workflow, like QA testing the game in Editor and Player, causing testing costs to double. However, testing "in-editor" traditionally was never valuable to us, because no customer is going to run the game in the editor anyway.

    Having the same error checks in a development Player, that are performed in the editor, does seem beneficial to me at the time of writing.
     
    Last edited: Jan 25, 2018
    Tudor-Nita likes this.
  50. snacktime

    snacktime

    Joined:
    Apr 15, 2013
    Posts:
    2,394
    I think most of the logic makes sense. I'd rather not see the old MS approach go into play here though, where we are arbitrarily restricted just because the authors think it's best we do things as they intend, without any actual technical justifications.

    Like reading data that other threads are writing is actually just fine in some cases. We do it all the time on the server. Most of the .NET concurrent collections do it. There are very valid reasons in different contexts to do that. Just because some people will screw it up doesn't mean you should flat out restrict it. Now if the job system has valid reasons for not doing it, like it would mess up internal structures in an unexpected ways (unexpected as in not following CLR rules), then fine that would make sense.