Pcc performance challenge

dynamicbutter · Jun 11, 2022

I've done everything I can think of to improve the performance of the algorithm here. This is my first foray into this type of optimization so I'm sure it can be improved upon. If anyone is interested in reviewing that code and coming up with other improvements to the Pcc tiny use case I'd love to hear about it. You can see the evolution of my optimizations looking at the PccJobv1 through PccJobv6 variations.

CodeSmile · Jun 11, 2022

What does the algorithm even do? You don't mention it and the code isn't obvious either.

It would help to move each version to a separate file.

dynamicbutter · Jun 12, 2022

Thanks for the feedback SteffenItterheim. The algorithm computes the Pearson Correlation Coefficient (PCC) which is mentioned near the top of the Readme. I had planned to put several algorithms in there (and still may), but for now PCC is all that it does. Have a look at SerialPccv5Tiny (the baseline) and ParallelPccv6Tiny (the best version I came up with). I have incorporated your feedback into v0.0.6, which is main latest as of this moment.

CodeSmile · Jun 12, 2022

OMG those formulas.
I'm out.

dynamicbutter · Jun 12, 2022

Oh no

Well, if it helps, the two c# implementations of the formula used by the performance tests are in Baseline.cs and aren't very complicated. This is the original baseline, and this is a slightly faster baseline I created for a more apples-to-apples comparison with what I did in the parallel version.

Enzi · Jun 12, 2022

This looks pretty optimized already. (Pccv4) Without loading it up myself, the only thing I could think of for this type of algorithm is to make sure Burst is vectorizing the loops.

Code (CSharp):

R[i] =

(Length * XYSumProd[i] - XResults[0] * YSum[i]) /

XResults[1] /

(math.sqrt(Length * YYSumProd[i] - YSum[i] * YSum[i]));

I'd change it to an innerloop of at least 4. There's no vectorization going on with just 1 and it would be perfect for this kind of data.
Also, I don't see the point of XResults being an array. It's 2 floats, and could possibly screw with the vectorization. If that's the case, change it to 2 parameters instead.

dynamicbutter · Jun 12, 2022

Thanks Enzi! Changing the merge job to IParallelForBatch and adding an inner loop (Pccv7) resulted in another 5% increase. Now its 134x faster than the single-threaded-non-burst-compiled baseline. When I first started this project I was targeting a 30x improvement so this result is fantastic and is fast enough for my intended purpose.

The only thing left to do is make this generic (currently hard-coded to float). But if I understand correctly it isn't practical for this kind of highly optimized code because C# doesn't yet have a way to do numeric operations (like +, -, *, etc) with generics. I'm hoping something like...
public PccJob<T> where T: numeric
will someday exist.

dynamicbutter · Jun 12, 2022

Enzi, I forgot to mention, I had a look at the burst inspector and the loops are vectorized.

Enzi · Jun 12, 2022

134x! Pretty awesome!
When you implement a PccJob it gets compiled with that type so generics shouldn't get in your way. Maybe abstract with a struct and an interface to have easier support where you'd implement one of the arithmetics in a method. Important thing is to check that Burst doesn't lose vectorization. Can be secured with Unity.Burst.CompilerServices.Loop.ExpectVectorized(); which needs a define "UNITY_BURST_EXPERIMENTAL_LOOP_INTRINSICS" -> https://docs.unity3d.com/Packages/com.unity.burst@1.4/manual/docs/OptimizationGuidelines.html

apkdev · Jun 13, 2022

dynamicbutter said: ↑

The only thing left to do is make this generic (currently hard-coded to float). But if I understand correctly it isn't practical for this kind of highly optimized code because C# doesn't yet have a way to do numeric operations (like +, -, *, etc) with generics. I'm hoping something like...
public PccJob<T> where T: numeric
will someday exist.
Click to expand...

A feature like this is indeed coming to .NET 6. Hopefully Unity.Mathematics supports this eventually: https://devblogs.microsoft.com/dotnet/preview-features-in-net-6-generic-math/
Until then you technically can write your own wrapper structs that implement mathematical operations, but without INumber and the static abstract interfaces feature the code quickly gets somewhat unwieldy.

dynamicbutter · Jun 13, 2022

apkdev said: ↑

but without INumber and the static abstract interfaces feature the code quickly gets somewhat unwieldy.
Click to expand...

Thanks for the info @apkdev! I agree about unwieldy and will wait for tools to support it before circling back.

Arowx · Jun 16, 2022

Code (CSharp):

R[i] =

(Length * XYSumProd[i] - XSum.Value * YSum[i]) /

XResult.Value /

(math.sqrt(Length * YYSumProd[i] - YSum[i] * YSum[i]));

You are re-accessing elements in an array multiple times YSum would it not be better to place this value in a variable or would that break the vectorisation?

Also if over multiple runs you are dividing by XResult.Value that does not change. It may be faster to invert the value and multiply instead as I believe multiplication is faster than division.*

*It looks like even in SIMD operations Multiplication is way faster (roughly 5-10x) than division Intel® Intrinsics Guide

dynamicbutter · Jun 16, 2022

Thanks @Arowx! I guess I sort of assumed the compiler was smart enough to do these kinds of things automatically, but tbh I'm not really sure. I'll run some tests and see how the generated code changes.

dynamicbutter · Jun 17, 2022

I got this running on iOS and the results are making me question some things. See this short writeup. Can anyone can explain why the baseline is faster on iOS?

cc @Enzi , @apkdev , @Arowx

Enzi · Jun 17, 2022

The Mac version has been compiled with Mono, right?
iOS is AOT so it goes through IL2CPP, hence the faster execution.

dynamicbutter · Jun 17, 2022

Thanks @Enzi! You are onto something. I recompiled Mac using IL2CPP scripting backend for the player and now the Mac baseline is close to the iOS baseline. But even with IL2CPP the baseline is still a little (30%) slower on the Mac hardware. Not sure if there are other settings to tweak or the iOS compiler is somehow smarter?

dynamicbutter · Jun 21, 2022

@Arowx ,

Arowx said: ↑

would it not be better to place this value in a variable or would that break the vectorisation?
Click to expand...

I tried it and it made absolutely no difference in the code produced. Good job Burst compiler team

Arowx said: ↑

It may be faster to invert the value and multiply instead as I believe multiplication is faster than division
Click to expand...

Tried this too and it reduced the main loop from 13 to 11 vector math instructions including getting rid of the
vrcpps
(reciprocal) instruction. Unfortunately, it did not measurably impact any of the performance tests.

Arowx · Jun 22, 2022

dynamicbutter said: ↑

Tried this too and it reduced the main loop from 13 to 11 vector math instructions including getting rid of the
vrcpps
(reciprocal) instruction. Unfortunately, it did not measurably impact any of the performance tests.
Click to expand...

Sounds weird that less instructions run at the same speed as more instructions but the actual instructions we give to a modern CPU can be converted to microcode on chip so maybe the divide by constant in loop was inverted at a lower level?
Also each instruction can have latency and CPU cycles timings that differ from each other and between CPUs.

dynamicbutter · Jun 24, 2022

Arowx said: ↑

weird that less instructions run at the same speed as more instructions
Click to expand...

My assumption is the loop is spending a significant portion of its time moving registers around and removing 2 out of 13 vector math instructions just doesn't make a measurable improvement. I think there is a minuscule improvement, but it can't be measured by my test. I did include your change @Arowx in the latest version

Here is the assembly in case I'm wrong about my assumption and any assembly experts out there are interested in shedding some light on why fewer instructions don't result in faster execution I'd love more info.

Here is the original version of the loop:

Code (CSharp):

public void Execute(int startIndex, int count)

{

for (int i = startIndex; i < startIndex + count; ++i) {

R[i] =

(Length * XYSumProd[i] - XSum.Value * YSum[i]) /

XResult.Value /

(math.sqrt(Length * YYSumProd[i] - YSum[i] * YSum[i]));

}

}

and here is the assembly produced from the original version of the loop:

Here is the reciprocal-multiply version of the loop:

Code (CSharp):

public void Execute(int startIndex, int count)

{

var recipXResult = 1 / XResult.Value;

for (int i = startIndex; i < startIndex + count; ++i) {

R[i] =

recipXResult *

(Length * XYSumProd[i] - XSum.Value * YSum[i]) /

(math.sqrt(Length * YYSumProd[i] - YSum[i] * YSum[i]));

}

}

and here is the assembly produced from the reciprocal-multiply version of the loop:

Search Unity

Pcc performance challenge

dynamicbutter

CodeSmile

dynamicbutter

CodeSmile

dynamicbutter

Attached Files:

Screen Shot 2022-06-12 at 12.55.46 PM.png

Enzi

dynamicbutter

dynamicbutter

Enzi

apkdev

dynamicbutter

Arowx

dynamicbutter

dynamicbutter

Enzi

dynamicbutter

dynamicbutter

Arowx

dynamicbutter

Search Unity

Unity ID

Useful Searches

Pcc performance challenge

Attached Files: