Random access array in a Job

Ahmed_Al_Asl · Apr 1, 2020

I'm trying to write radix sort using the job system but I found out that it's slower than other sorting method while using job system.

after some testing I found out that accessing random element in a job is what makes it slow.
for example I have a job to build histogram it's reading from target array and use that data to decide which element to increment in a histogram array.

I try to only read and do the math on the target array and that was pretty fast
then I try to write to histogram while reading but only to one element and that was also fast
but mixing the two together is 10 time slower.

Code (CSharp):

[BurstCompile]

struct BuildHistogramJob : IJobFor

{

[ReadOnly]

public NativeArray<int> array;

public int digitIndext;

public NativeArray<int> histogram;

public void Execute(int index)

{

histogram[GetDigit(array[index],digitIndex)]++;

}

}

why this is happening? and is there anyway around this ?

DreamingImLatios · Apr 1, 2020

Random access is tough on the CPU because the CPU can't figure out in advance where the reads and writes are in memory and so it has to wait until last-minute to load the memory into cache. That's slow. Sometimes you can fix it with prefetching. Sometimes you just need to look into the details of your algorithm and figure out if there's a smarter way to access your data.

I have written a floating point radix sort for Burst and it crushes everything else for datasets with more than 300 elements.

Arowx · Apr 2, 2020

Found this paper on comparisons of sorting functions what is fascinating is it reports that you only benefit from more threads when you have a much larger data set e.g. > 4096 to sort (see page 44 on the Radix sort)?

Wondering if you have benchmarked a single threaded version versus a multi-threaded version and compared different element quantities?

Arowx · Apr 2, 2020

Googling around there are SIMD versions of sorting algorithms that can boost performance depending on the size of the overhead and SIMD instruction set available.

Therefore could the Jobs/Burst libraries provide a more optimal general sorting algorithm?

DreamingImLatios · Apr 2, 2020

Arowx said: ↑

Found this paper on comparisons of sorting functions what is fascinating is it reports that you only benefit from more threads when you have a smaller data set to sort (see page 44 on the Radix sort)?

Wondering if you have benchmarked a single threaded version versus a multi-threaded version and compared different element quantities?
Click to expand...

You aren't reading that paper right. It suggests the exact opposite actually, and is consistent with what I have found with pretty much all algorithms and multithreading.

Ahmed_Al_Asl · Apr 2, 2020

I did multiple test on different sorting methods on the managed side and on the Native side using job and burst, I also build the player using IL2CPP and here's what I found :

please note that all these test are single threaded using one Job (IJob) it's the same algorithm on each side
the LSD radix find the max value where the MSD doesn't ( if the MSD try to find the max value ,for this case it will be the fastest sort in the managed side )

Managed =>
> average run time for SelectionSort after 100 test is 3.666943 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for InsertionSort after 100 test is 1.015232 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for BubbleSort after 100 test is 10.92266 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for CombSort after 100 test is 0.330589 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for ShellSort after 100 test is 0.343845 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for HeapSort after 100 test is 0.4121131 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for MergeSort after 100 test is 0.282422 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for MergeInsertionSort after 100 test is 0.203311 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for QuickSort after 100 test is 0.158291 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for RadixLSDBinarySort after 100 test is 0.10649 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for RadixMSDSort after 100 test is 0.125869 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for InPlaceRadixSort after 100 test is 0.152194 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000

Native =>
> average run time for SelectionSort after 100 test is 0.4209231 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for InsertionSort after 100 test is 0.146739 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for BubbleSort after 100 test is 0.6081371 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for CombSort after 100 test is 0.249881 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for ShellSort after 100 test is 0.07714503 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for HeapSort after 100 test is 0.067807 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for MergeSort after 100 test is 0.057691 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for MergeInsertionSort after 100 test is 0.045217 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for QuickSort after 100 test is 0.044245 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for RadixLSDBinarySort after 100 test is 0.08183099 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for RadixMSDInPlaceSort after 100 test is 0.3571751 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for RadixMSDSort after 100 test is 0.120616 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000

in general the native is just a better version of the managed , except for radix

Ahmed_Al_Asl · Apr 2, 2020

DreamingImLatios said: ↑

Random access is tough on the CPU because the CPU can't figure out in advance where the reads and writes are in memory and so it has to wait until last-minute to load the memory into cache. That's slow. Sometimes you can fix it with prefetching. Sometimes you just need to look into the details of your algorithm and figure out if there's a smarter way to access your data.

I have written a floating point radix sort for Burst and it crushes everything else for datasets with more than 300 elements.
Click to expand...

that was my first thought , some methods like the quick sort swap element and read them in unpredictable way also but it didn't get effected like the radix do.

Ahmed_Al_Asl · Apr 2, 2020

Arowx said: ↑

Googling around there are SIMD versions of sorting algorithms that can boost performance depending on the size of the overhead and SIMD instruction set available.

Therefore could the Jobs/Burst libraries provide a more optimal general sorting algorithm?
Click to expand...

I feel like the answer is hidden in how the job work internally

DreamingImLatios · Apr 2, 2020

Ahmed_Al_Asl said: ↑

I did multiple test on different sorting methods on the managed side and on the Native side using job and burst, I also build the player using IL2CPP and here's what I found :

please note that all these test are single threaded using one Job (IJob) it's the same algorithm on each side
the LSD radix find the max value where the MSD doesn't ( if the MSD try to find the max value ,for this case it will be the fastest sort in the managed side )

Managed =>
> average run time for SelectionSort after 100 test is 3.666943 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for InsertionSort after 100 test is 1.015232 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for BubbleSort after 100 test is 10.92266 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for CombSort after 100 test is 0.330589 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for ShellSort after 100 test is 0.343845 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for HeapSort after 100 test is 0.4121131 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for MergeSort after 100 test is 0.282422 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for MergeInsertionSort after 100 test is 0.203311 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for QuickSort after 100 test is 0.158291 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for RadixLSDBinarySort after 100 test is 0.10649 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for RadixMSDSort after 100 test is 0.125869 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for InPlaceRadixSort after 100 test is 0.152194 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000

Native =>
> average run time for SelectionSort after 100 test is 0.4209231 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for InsertionSort after 100 test is 0.146739 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for BubbleSort after 100 test is 0.6081371 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for CombSort after 100 test is 0.249881 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for ShellSort after 100 test is 0.07714503 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for HeapSort after 100 test is 0.067807 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for MergeSort after 100 test is 0.057691 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for MergeInsertionSort after 100 test is 0.045217 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for QuickSort after 100 test is 0.044245 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for RadixLSDBinarySort after 100 test is 0.08183099 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for RadixMSDInPlaceSort after 100 test is 0.3571751 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000
> average run time for RadixMSDSort after 100 test is 0.120616 ms for 1000 element of type UInt32 with sorted percentage 0/100 with numbers ranges from 0 to 1000

in general the native is just a better version of the managed , except for radix
Click to expand...

Benchmarks are meaningless without code, as specific implementation details can have surprisingly drastic effects on the speed when working at this low-level (Burst).

As I have mentioned before, I use a floating point LSB 8-bit radix sort which beats out every other sort I have tried when there are more than 300 elements and Burst is enabled. The second fastest-sort I have seen, and the fastest at less than 300 elements is Unity's NativeSort, which I believe is primarily a segment sort variant. Neither of these algorithms make great use of simd. And while I believe that is possible, it just doesn't matter. Why? Because unless your payload is extremely small (which is rare in practice), randomly accessing or rearranging the payload is far more expensive than the sort itself.

Ahmed_Al_Asl · Apr 2, 2020

DreamingImLatios said: ↑

Benchmarks are meaningless without code, as specific implementation details can have surprisingly drastic effects on the speed when working at this low-level (Burst).
Click to expand...

I agree with you, but what I have is the same code on the native side and on the managed side the only thing that change is the type of Collection ( NativeArray vs List ) and the fact that the code is wrapped inside a Job ( single thread sort ) what I want to show out from the benchmark is how radix change between the native and the managed compare to other sorting methods.

you can check what what's going on with the native sort the code is open for you to see, they are using combination of quick sort and insertion sort and you can also see a heap sort but I think the code ever reachs it, when you are using a sorting job you can also see a merge sort being added to the mix ( I think that's what are you referring to as segment sort.

when you write your sort did you make your test in the player and did you build with IL2CPP ?

DreamingImLatios · Apr 3, 2020

Ahmed_Al_Asl said: ↑

I agree with you, but what I have is the same code on the native side and on the managed side the only thing that change is the type of Collection ( NativeArray vs List ) and the fact that the code is wrapped inside a Job ( single thread sort ) what I want to show out from the benchmark is how radix change between the native and the managed compare to other sorting methods.
Click to expand...

Well given there is a performance difference between Mono and Burst for every sort, that suggests the compilers are interpreting the code very differently. It shouldn't be surprising that certain implementation details can trip up one compiler but not the other. My point is that I wouldn't draw the conclusion that radix sort is slow in Burst just because your particular implementation of radix sort is slow in Burst.

Ahmed_Al_Asl said: ↑

you can check what what's going on with the native sort the code is open for you to see, they are using combination of quick sort and insertion sort and you can also see a heap sort but I think the code ever reachs it, when you are using a sorting job you can also see a merge sort being added to the mix ( I think that's what are you referring to as segment sort.
Click to expand...

It's gone through a few iterations. I stopped caring as performance only continued to improve but never invoked simd and has not caught up to my radix sort yet for heavy use cases.

Ahmed_Al_Asl said: ↑

when you write your sort did you make your test in the player and did you build with IL2CPP ?
Click to expand...

I use the performance testing framework as well as builds for both for Mono and IL2CPP. However, I have really only profiled Burst in all three cases and the timings remained the same. I don't care much about Mono or IL2CPP performance. They are both way too slow.

Ahmed_Al_Asl · Apr 3, 2020

DreamingImLatios said: ↑

My point is that I wouldn't draw the conclusion that radix sort is slow in Burst just because your particular implementation of radix sort is slow in Burst.
Click to expand...

the only thing that you test is LSB Radix and you are trying to answer the question without testing any other radix methods ,for the LSB Radix I have the same conclusion as you did but the MSD is a different case, you keep saying "it just doesn't matter. Why?" but I do care why this is happening ,so please participate if you have something to add.

DreamingImLatios · Apr 3, 2020

Ahmed_Al_Asl said: ↑

the only thing that you test is LSB Radix and you are trying to answer the question without testing any other radix methods ,for the LSB Radix I have the same conclusion as you did but the MSD is a different case, you keep saying "it just doesn't matter. Why?" but I do care why this is happening ,so please participate if you have something to add.
Click to expand...

My point was that the order of algorithms performance-wise being different between Mono and Burst is expected. I don't care about Mono's performance, only Burst, because that's what is going to run in release for me.

I'm not going to write a bunch of different sorts to prove that you suck at writing radix sorts for Burst (I don't believe that by the way). But if you post code of your implementation, I can help you identify what might be tripping Burst up.

Search Unity

Unity ID

Useful Searches

Random access array in a Job