How to do mesh generation in parallel jobs

pbhogan · Sep 25, 2019

I am procedurally generating a mesh in a job. The job produces a series of vertices and in the single threaded case just adds them to a NativeList. When the job is done the vertices are set on the mesh.

However, the algorithm is parallelizable, and for each parallelizable unit (call to Execute( int index )) a variable number of vertices may be generated.

My question is, how do I most efficiently get these out of a parallel job? Adding to a NativeList is no longer an option because it can't support concurrent writes.

My first approach was to use a preallocated NativeArray and have each thread use an atomic counter to safely increment the indices into which to write in the vertices. However, I ran into the problem that IJobParallelFor restricts writing to the specific index provided. And, as @Joachim_Ante pointed out (https://forum.unity.com/threads/how...index-with-the-jobsystem.519981/#post-3411614), performance would probably suffer due to cache line issues on the shared atomic counter.

I'm considering trying to create a native container that uses the NativeContainerNeedsThreadIndex attribute and keeps a list per thread ID which it can add to safely. Then when it's done, the individual lists can be merged into a single array/list at the cost of some memory and copying.

This seems like something that would be a common problem in general... that is, having a parallel job producing a list of elements concurrently. Has anyone found a good way to handle this?

tertle · Sep 25, 2019

Not sure I'd bother. Unless this is a huge mesh generated by an extremely slow algorithm the overhead might end up being faster in a single thread anyway.

I always just generate 1 mesh per thread.

Anyway the solution is usually to use a native queue or steam with a custom triangle struct which holds 3 points, 4 uvs etc to ensure order is not lost. Then convert to a list in another job.

pbhogan · Sep 25, 2019

tertle said: ↑

Not sure I'd bother.
Click to expand...

Maybe not, and I might just optimize it later rather than prematurely going down a rathole. But, as an experiment, manually splitting the workload into JobsUtility.JobWorkerCount jobs with separate output native lists and then concatenating them before assigning to the mesh yields about a 10-15% performance improvement. So... maybe worth it? ¯\_(ツ)_/¯

I suspect any gains would be lost using queue or stream with the inherent performance loss they have over a list/array, combined with the conversion cost. But I might try them for comparison.

Thanks for the input though. Some of your other posts in various places about mesh generating were very helpful.

DreamingImLatios · Sep 26, 2019

pbhogan said: ↑

I suspect any gains would be lost using queue or stream with the inherent performance loss they have over a list/array, combined with the conversion cost
Click to expand...

They might be slower than writing to an array, but they definitely are not slower than NativeList.Add, at least not by a factor that should encourage you to bend over backwards to avoid. Both NativeQueue and NativeStream have cache-efficient writes and essentially do the same thing you do with the manual NativeLists except a parallel job has load balancing characteristics. As for conversion, it is fast as long as you do it in a job with Burst (a lot of people on these forums forget that part and conclude conversion is slow). And if your data is that massive, you could even write a custom NativeStream converter that runs in a parallel job.

sngdan · Sep 26, 2019

The best solution is to do something independent in parallel. This is something that should be highlighted more. Unless for stress testing a single job, running it parallel can be a later optimization in case threads are not filled up

pbhogan · Sep 26, 2019

DreamingImLatios said: ↑

They might be slower than writing to an array, but they definitely are not slower than NativeList.Add, at least not by a factor that should encourage you to bend over backwards to avoid. Both NativeQueue and NativeStream have cache-efficient writes and essentially do the same thing you do with the manual NativeLists except a parallel job has load balancing characteristics. As for conversion, it is fast as long as you do it in a job with Burst (a lot of people on these forums forget that part and conclude conversion is slow). And if your data is that massive, you could even write a custom NativeStream converter that runs in a parallel job.
Click to expand...

Thanks for the info! I'll give those a go and see how they perform. I try to do everything in a job, of course.

pbhogan · Sep 26, 2019

pbhogan said: ↑

Thanks for the info! I'll give those a go and see how they perform. I try to do everything in a job, of course.
Click to expand...

Some initial findings:

Simply writing to NativeQueue in parallel is about 5x slower than adding to a NativeList in a single job, and that's excluding the conversion step that needs to happen after. So, that's a bust.

I implemented a "NativeBag" container that internally manages an UnsafeList per thread and then copies them all into a list afterwards. This outperforms NativeQueue, but only slightly. I find the slow performance a bit suspicious as it's essentially doing the same thing as my previous experiment splitting the workload manually over multiple jobs. Maybe I'm doing something wrong or there are cache line issues.

I'd like to try NativeStream for comparison, but can't figure out how to use it properly and it's not documented anywhere I can find. When I do try I get a few "NativeStream.Writer must be passed by ref once it is in use" errors and then it seems to be basically failing silently on writes and performance is even slower than NativeQueue.

Does anyone have an example of how I would write to a NativeStream in IJobParallelFor and then read it out after?

pbhogan · Sep 26, 2019

pbhogan said: ↑

Some initial findings:
Click to expand...

My findings previously posted were incorrect due to a bug in my code. Here are the correct findings:

I tested generating ~200k vertices per frame with five approaches. The average CPU time per frame is mentioned in parenthesis. All jobs use Burst. Tests were run on a CPU with 3 workers threads for jobs. I'll try this later on a 6 core machine (11 worker threads) to see how that changes the performance characteristics.

RenderGridSimpleJob (65ms)
This is the original job generating vertices and adding to a native list.

RenderGridSimpleSplitJob (36ms)
This is the manual split of simple job over separate native lists. Once all jobs are completed, the lists are concatenated. This concatenation is not currently done in a job as the number of lists are determined at runtime (a multiple of JobsUtility.JobWorkerCount), but it could be moved to a job by using a static the number of jobs. Performance gains would be likely be negligible for these memory copy operations.

RenderGridParallelStreamJob (80ms)
This is a parallel job writing to a native stream which is then read and converted into a native list by another job. On lower vertex counts (~15k) it seems to be faster than the simple method, and on par with the other methods.

RenderGridParallelQueueJob (44ms)
This is a parallel job adding to a native queue which is then read and converted to a native list by another job. This is, somewhat surprisingly, the fastest of the out-of-the-box solutions.

RenderGridParallelBagJob (40ms)
This is a parallel job adding to my custom native container managing internal native lists per thread, respecting cache lines, etc. and then copied out into a single list by another job. It's only slightly slower than RenderGridSimpleSplitJob while doing about the same thing in a contained way.

RenderGridSimpleSplitJob is the fastest, but a bit more housekeeping. It could probably be sped up even more at the cost of some memory by using preallocated native arrays and a simple custom native count to return the last written index in each array.

I can post my custom native container (NativeBag) code at some point if anyone is interested.

tertle · Sep 26, 2019

pbhogan said: ↑

My findings previously posted were incorrect due to a bug in my code. Here are the correct findings:

I tested generating ~200k vertices per frame with five approaches. The average CPU time per frame is mentioned in parenthesis. All jobs use Burst. Tests were run on a CPU with 3 workers threads for jobs. I'll try this later on a 6 core machine (11 worker threads) to see how that changes the performance characteristics.

RenderGridSimpleJob (65ms)
This is the original job generating vertices and adding to a native list.

RenderGridSimpleSplitJob (36ms)
This is the manual split of simple job over separate native lists. Once all jobs are completed, the lists are concatenated. This concatenation is not currently done in a job as the number of lists are determined at runtime (a multiple of JobsUtility.JobWorkerCount), but it could be moved to a job by using a static the number of jobs. Performance gains would be likely be negligible for these memory copy operations.

RenderGridParallelStreamJob (80ms)
This is a parallel job writing to a native stream which is then read and converted into a native list by another job. On lower vertex counts (~15k) it seems to be faster than the simple method, and on par with the other methods.

RenderGridParallelQueueJob (44ms)
This is a parallel job adding to a native queue which is then read and converted to a native list by another job. This is, somewhat surprisingly, the fastest of the out-of-the-box solutions.

RenderGridParallelBagJob (40ms)
This is a parallel job adding to my custom native container managing internal native lists per thread, respecting cache lines, etc. and then copied out into a single list by another job. It's only slightly slower than RenderGridSimpleSplitJob while doing about the same thing in a contained way.

RenderGridSimpleSplitJob is the fastest, but a bit more housekeeping. It could probably be sped up even more at the cost of some memory by using preallocated native arrays and a simple custom native count to return the last written index in each array.

I can post my custom native container (NativeBag) code at some point if anyone is interested.
Click to expand...

You are forgetting 1 important thing when benchmarking these things though.

Your parallel jobs stop other jobs from executing at the same time. They might be faster independently but slowerooverall for your application.

Your original job it takes 65ms on a single thread to generate a mesh.

Now image you want to generate 4 meshes on a quad core. Schedule them side by side and it's still only taking 65ms.

Take your 'optimized' split job. Since it's split over cores other jobs have to be schedule after. So if you want to generate 4 meshes that now takes 144ms!

If you're not getting close to 1x1 thread to performance improvement, you're parallelism is reducing the theoretical max throughout of your application.

pbhogan · Sep 26, 2019

tertle said: ↑

You are forgetting 1 important thing when benchmarking these things though. Your parallel jobs stop other jobs from executing at the same time. They might be faster independently but slower overall for your application.
Click to expand...

I haven't forgotten.

In my particular use case (which is quite simple, admittedly) there would be no other jobs running concurrently. But I'm fully aware it's all very nuanced with no one-size-fits-all winner.

These benchmarks are just helping me get a feel for the performance characteristics of these various approaches rather than declaring some kind of best general purpose solution. I'm learning my way around the job system code and documenting some results here since information seems to be quite sparse.

Obviously, in a complex application with lots of jobs, and maybe ECS too, it would take looking at the profiler, timeline, etc. and evaluating what is a better fit, and you're right that sometimes a single non-parallelized job is going to win out.

That said, sometimes one of the parallel approaches might do better as they'll load-balance batches around other jobs running concurrently and would scale better if someone runs them on something with a lot more cores even if it runs marginally slower on a low core count. There might even be value in having separate code paths for low and high core counts.

DreamingImLatios · Sep 27, 2019

1) Make sure you are profiling the time it takes to get from desired input to desired output. That means input to your mesh generation to final compacted NativeList.
2) Make sure you have safety checks and leak detection disabled when profiling.
3) If you are still seeing strange results, check the Burst assembly. See what is being vectorized or not vectorized and what is generating unnecessary branches.

tonytopper · Mar 3, 2022

Thanks for this conversation. I am generating about 300 meshes and combining them together to make a tree mesh using an L system. I'm thinking trying of to make the mesh generation multithreaded and this information was insightful.

(Unfortunately, I am using LineRenderer's BakeMesh which means I need to find a purer tapered line algo I can work with on separate threads.)

jbat100 · Sep 14, 2023

Catlike coding's tutorial series using the job system and the new mesh API is full of good ideas

https://catlikecoding.com/unity/tutorials/procedural-meshes/

Search Unity

Unity ID

Useful Searches

How to do mesh generation in parallel jobs