I tried Unity's Job System the last 2 days. Here is what I found out

FM-Productions · Mar 18, 2018

Hi everyone,

as many of you already know, The job system is useable since Unity 2018.1 beta. It allows developers to run code safely across multiple threads. I thought it might be useful to share the information I gathered while trying the Job system. Some code examples given here are not from my game, but from the top of the head to simply examplify a mentioned point. I have also posted this on reddit, you can view it here: (https://www.reddit.com/r/Unity3D/comments/859rmv/i_tried_unitys_job_system_the_last_2_days_here_is/)
I also apologize for any grammatical errors in my text or formatting mistakes but let's start:

I successfully made use of the job system in the past 2 days. Inspired by this great video (
) I used it to resolve detailed hitbox collisions between 2 characters for my fighting game. When I have about 16 characters on the screen and they stand next to each other, the collision resolution between 2 characters gets called up to 92 times. Worst case scenario would be 16 * 15 checks (each character with each other) but there are a few checks I make (like bounding box intersection) before resolving hitbox collisions. With that many calls per frame this seemed like a pretty good use case for the job system. I tell you what I found out:

- Jobs use the worker threads really efficiently. If you schedule all your jobs carefully, the jobs use all the available worker threads.

- Jobs are structs. there are a few things you have to consider when working with structs. Compared to classes, they are always passed by value and not by reference. Consider having a list of structs (Edit: I previously wrote about arrays, but this seemed wrong. This actually occurs on a list). accessing a property like this:
list[index].someProperty
will actually return a copy of the struct stored in list[index] and return its property, because using an index accessor on a list is actually a function call and the accessed object is the return value.
When using arrays, it seems to be valid to change a property of a struct by accessing it via the index:
array[index].someProperty = 6 //seems to be valid in C# but not with lists
You are also able to call functions on the structs in the array. If for example you call
array[index].ChangeInternalValue()
and the function alters values inside the struct, they are altered for the struct in array[index] (no copy is returned). There is a blog post about this phenomenon with lists here: https://generally.wordpress.com/2007/06/21/c-list-of-struct/

- Since jobs are structs, you have to be careful when passing them. When you have a struct with a huge amount of member variables, a copy of the struct is really expensive. So you have to watch out where you are performing a copy but don't intend or need to. If you have to access a variable of a struct stored in an list more than once, it is better to store a copy of the struct in another variable:
list[index].someProperty  //always makes a copy of the struct and then returns someProperty of the copy
var struct =  list[index]; //better, you only perform one copy and then you can access the copied struct directly without having to perform another copy

struct.someProperty;

struct.someOtherProperty; 
- You can pass structs as parameter to a function without having .NET perform a copy of the struct by using the ref keyword in your method. You pass a reference of the struct and .NET should not box (convert it into an object) the struct, so it should be pretty performant. Here are some links:
https://stackoverflow.com/questions...when-passed-to-a-method-using-the-ref-keyword
https://stackoverflow.com/questions/7566939/how-to-make-a-reference-to-a-struct-in-c-sharp
void method(ref MyStruct param)

{

}
- Ideally, you use the job system with data oriented design in mind. That means you have a data layout that is efficiently for the cpu to access. For example having one array of structs for each variable in an operation. So if you want to calculate the global position from the local position for example, you have the operation:
Vector3 globalPos = parentTransform.rotation * localPosition + parentTransform.position;
Ideally you have a separate NativeArray for parentTransform.rotation variables, one for the localPositions, one for the parentTransform.positions and another one for the results. Such a data structure only makes sense if you have many operations like this. This approach prevents cache misses. Accessing a variable from the RAM is much more expensive than accessing a variable on a CPU cache (in terms of CPU cycles). Here is a very good talk that also covers the topic of accessing variables from the RAM vs from the cache. I recommend you to watch it if you plan to use the job system:

- There is a little overhead when scheduling the jobs on differnt threads. It is small but worth mentioning.

- Keep the variables your job uses separated from the variables you use in the main thread. This seems like common sense but I'll still mention it. If you try to read/write to a variable from differnt threads simultaneously, you'll run into errors or at least unexpected behaviour, except you have implemented a decent locking system that prevents the other threads from accessing variable while it is being used by a thread.

- NativeArrays only support structs using blitables types. Essentially these are types that have the same representation in unmanaged code and do not require marshalling. Here is a link: https://docs.microsoft.com/en-us/dotnet/framework/interop/blittable-and-non-blittable-types
I was surprised that boolean variables are not supported, so I stored an integer instead. I treat 0 as false and every other value as true.

- NativeArrays have to be manually deallocated from memory. Depending on their Allocation setting, you either have to call Dispose() on them at the end of the frame, or if you initialized a permanent NativeStruct, you have to call Dispose() at the end of the intended lifetime (when you are switching to another scene or closing the application for example). Not doing so will cause a memory leak. You also want to make sure that the application can always dispose the NativeArray. So let's say you initialized such an array but during the use of it you run in an exception. Unity would throw the error and exit the function. You should use a try catch block over the code where an exception is possible and in your catch block, you can check if the NativeArray is instantiated and if so, call NativeArray.Dispose()

- NativeArrays support nested structs, as long as all used structs only use blitable data types. You can for example use your own structs. One might look like this:
public struct DataContainer {

   public Vector3 position;

   public float radius;

   public int instanceId;

}
- NativeArrays have the functions CopyTo and CopyFrom. You can copy array contents from your game variables to the native array or store them back into your actual game array from the NativeArray. But like with structs, copying arrays with a huge amount of data is really expensive. For each of my collision jobs, I had to copy the hitbox/hurtbox data of characterA (the attacking character) and characterB (the character expected to receive damage). Since I only use spheres for hitboxes, this means I copied data defining the hitbox spheres (around 30 per character) + active hitboxes with some additional properties for each character. Copying this values negated the performance increase from making use of the jobs. So sadly, I cannot efficiently use the job system like this, because running those function single threaded takes less cpu time.

- An ideal use case is to start an operation with jobs at the start of the frame, do something else in the main thread in between and apply the results of the jobs at the end of the frame. So the jobs have plenty of time to execute.

- There are the interfaces IJob and IJobParallelFor. You can look up the reference on these pages (including examples):
https://docs.unity3d.com/2018.1/Documentation/ScriptReference/Unity.Jobs.IJob.html
https://docs.unity3d.com/2018.1/Documentation/ScriptReference/Unity.Jobs.IJobParallelFor.html
Your job struct has to implement one of those interfaces. IJob is used for one execution function that is calculated on one thread. IJobs run in parallel from each other if scheduled at the same time. IJobParallelFor is for functions where you iterate over large arrays. Each function call represents one action on a certain array index. For example:
public void Execute(int index){

  position[index] = rotations[index] * localPosition[index];

}  
With jobs implementing IJobParallelFor, each Execute call can happen on different thread. So you want to make sure to have no global variables you use inside:
Quaternion[] quaternions = ...

public void Execute(int index){

  position[index] = rotations[index] * localPosition[index];

  foreach (Quaternion in quaternions) {

    //Bad idea, do not do this. Execute(index) runs on different threads simultaneously, meaning that

   //you are potentially accessing the same elements of the quaternions array from different threads at the same time.  

  }

}  
- You are able to make Jobs depending on each other. When you create a JobHandle (which you can use to complete the job manually at a later time) with Job.Schedule() you can pass a dependency to the Schedule function. This way you can say to only execute the job when the dependend job is finished. When you have chained job dependencies, you only have to call JobHandle.Complete() on the last job handle in the chain and the job chain is resolved from the first job to the last job, you don't have to call Complete() on all jobs in the chain.

So what is the main takeaway from this findings:
- Always use the profiler to confirm if you actually had a performance gain or if you have lost performance.
- Copying large amounts of data is expensive, if you have many jobs that have to copy a large amount of data before they can be executed, it can be more efficient to simply run the logic on the main thread.
- Design your game with data oriented design in mind to make use of the Job system even more efficiently.
- Make yourself familiar with the nuances of structs in C# if you are unfamiliar with them.
- Schedule your jobs early in the frame and use the result later in the frame. In the meantime, do some work on the main thread and process the job results afterwards. This way, the main thread does not only have to wait for the jobs to finish, but can do work on it's own until the jobs finish.

What exactly were my results:
- When the hitbox collision resolution function was called 12 times per frame, my old single threaded approach only takes about 50% of the time the job system approach takes.
- When the hitbox collision resolution function was called 92 time per frame, the execution time difference was only around 30%, with the single threaded approach being better.
- With those results, I can say that the more function iterations you have, the better the code execution scales when using the job system.
- The calculations itself perfrom pretty good on the worker threads when using the job system. What helds me back is the data that has to be copied to the jobs before running the jobs. This operation takes over 1 ms for my 92 iterations.
- I scheduled the jobs and then immediately waited for the jobs to complete, so the main thread did effectively wait. You might not want to do this, but often the function executions that follow depend on the results from previous function executions. I'd have to restructure my game in order to use the job system more efficiently.

I hope I could help some of you with the knowledge I gathered from trying the system! I you are doubting any mentioned points or want to correct me on something, feel free to do so!

Edit:
I finally made a little benchmark comparing the Character-Character hitbox collision functions inside the Editor, with the Mono Standalone Build and with the IL2CPP Standalone build. It is ridiculous that the Standalone builds run the function more than 400% faster! This time I ran 82 character-charcter collision checks per test. I noted 4 sample values for each test:

82 Character-Character-Hitbox-Collision Checks:

The benchmark tested the ticks of the processor:

Inside the editor:
Single Threaded:
15057
14532
14201
14490

Multi Threaded Jobs:
25662 (2ms)
28836
24788
27680

With Mono Standalone Build (Graphics Quality: Good)
Single Threaded:
3144
3239
3272
3166

Multi Threaded Jobs:
3619
3417
3592
3418

With IL2CPP Build (Graphics Quality: Good):
Single Threaded:
3094
3087
3034
3098

Multi Threaded Jobs:
3063
4083
3364
3033

My job system approach still makes copies of all the hitbox data. I may be able to structure the data differently to avoid copying a large amount of data for each job and safe a few ticks for the next iterations.

hippocoder · Mar 18, 2018

Did you test standalone builds? it is pointless to bench jobs in editor due to the amount of extra checks Unity does for it.

FM-Productions · Mar 18, 2018

hippocoder said: ↑

Did you test standalone builds? it is pointless to bench jobs in editor due to the amount of extra checks Unity does for it.
Click to expand...

Fair point, I only tested it in the Editor. I now added this in my post. Actually, I don't really know how to approach a performance benchmark for the critical section of the code only (the character collision resolution). I could make an UI button and toggle between the performing the single threaded collision checks and the collision checks with the job system while leaving everything else the same in the code. Then I compare the framerate. Or is there a better way to do it?

hippocoder · Mar 18, 2018

I've no idea, I'm just getting into this myself. I intend to use Jobs exclusively with ECS (they have an incredibly complimentary pattern).

FM-Productions · Mar 18, 2018

From this article:

At GDC 2018, Unity will dispel that notion by publicly releasing a preview of its new core engine technology: “We are going to release the Entity Component System, the C# Jobs System and the Burst compiler at GDC,” Ante reveals.

I am also excited about the new ECS. But it is possible to design your data format in a way to complement the Job system without the new ECS. But I admit that is a lot harder with the current systems Unity has. Ideally you only have data containers and the rendering and other functionality completely separated from the data. Certain system (like a rendering manager for example) operate on this data and perform certain actions. Right now I am solving this by using classes that do not use any of the Unity components. I link the GameObjects that relate to the data with an ID and have a function that takes the visual informaion from the data struct and applies it the the GameObject. I'm only at the start of building this system and it is probably better to wait for the ECS, since it is perfectly suited for this use case.

hippocoder · Mar 18, 2018

It's just worth breaking your rules and getting with Unity on this one. I'm having to do it and the faster than optimised C++ performance is just too logical to pass up. Personal preference goes out of the window and you just embrace the whole gift Unity and team are giving us.

This is so game changing that I expect most studios to start drifting over for various games.

sngdan · Mar 18, 2018

- in a build you run through the different test cases sequentially (of course a button to click would also work), each having it's own gui text to track the result
- you can simply use a timer (i.e. stopwatch) to measure the ticks / ms execution time and output to the gui text
- I tried both job system and threads and different from what I indicated in this thread (https://forum.unity.com/threads/tho...tyle-collision-detection.517400/#post-3404431), did not work further on this
- The reason, I did not further look into it is that I also wait for ECS release, because it goes hand in hand and also because it seems the job system needs the burst compiler and some other features to match / outperform threads at the moment (https://forum.unity.com/threads/job-system-not-as-fast-as-mine-why-not.518374/)

FM-Productions · Mar 18, 2018

@hippocoder
I finally did a little benchmarking, I added it in the main post above. The job system approach is actually nearly as fast as my single threaded approach.

sngdan · Mar 18, 2018

If I understood your 1st post, you are in a parallel for scenario, right?

It would be interesting what benchmarks you get for System.Threading.Tasks.Parallel.For

Edit: To be honest, I did not read your post very careful (too much text) but it is likely that the issue you are describing (having to copy a hitbox array) is (a) a design issue, (b) if you use tasks goes away even if you leave it as is (you only read, right?) --- i am by no means on an advanced level with c#jobs / tasks but when I played with it the other day, it seems that tasks are less "managed" and therefor allow you some more stuff at the risk of breaking things

FM-Productions · Mar 18, 2018

I plan to make an implementation where I use my hitbox arrays in different jobs without copying them (yes, they are readonly, so it might work). My data structure is a little messy right now, but I made sure to provide an appropriate data structure for the jobs (I am converting the hitbox/hurtbox data structure from the structures I use in my game to structures that are efficient to use by the job system. This "conversion" only takes around 0.02 ms per frame).
I am sure that I can get better results than my single core approach if I change the design adequately, but it is not that easy.

LaneFox · Mar 18, 2018

Interesting findings. Perf seemed negligible difference in the end for your use case but I'm happy too see a new practical application of this workflow and that it can scale well.

Thanks for taking the time to share.

Joachim_Ante · Mar 20, 2018

- Copying large amounts of data is expensive, if you have many jobs that have to copy a large amount of data before they can be executed, it can be more efficient to simply run the logic on the main thread.
Click to expand...

Indeed. Copying a lot of data on the main thread to then do small amounts of computation in a job is not the way to use the Job System. The goal should be to scale linearly with the number of cores you have.

This is in fact why we built the Entity Component System, because ultimately if you want to get great performance you have to keep your data in the right data layout so that the C# job system can actually take advantage of it.

The very important point here is also that it's really all about the combination of C# jobs, Burst and Entity Component System. Multithreading, Optimal machine code and data layout.

If you want to get massive performance improvements those things need to be well balanced. For example SIMD instructions without perfectly packed data usually results in being bandwidth bound and SIMD doesn't give you huge benefits.

So I think if you are really interested in exploring how to write high performance code, please check out the Entity Component System documentation and existing sample code.
https://forum.unity.com/forums/entity-component-system-and-c-job-system.147/

And really try to dig into going with the flow of working with IComponentData / JobComponentSystem etc.

Best way to help out is to actually post sample code etc. We'd be happy to give advice on specific examples of how to write great Component System code.

Search Unity

I tried Unity's Job System the last 2 days. Here is what I found out

FM-Productions

hippocoder

Digital Ape

FM-Productions

hippocoder

Digital Ape

FM-Productions

hippocoder

Digital Ape

sngdan

FM-Productions

sngdan

FM-Productions

LaneFox

Joachim_Ante

Unity Technologies

Search Unity

Unity ID

Useful Searches

I tried Unity's Job System the last 2 days. Here is what I found out

Digital Ape

Digital Ape

Digital Ape

Unity Technologies