Search Unity

Pcc performance challenge

Discussion in 'Entity Component System' started by dynamicbutter, Jun 11, 2022.

  1. dynamicbutter

    dynamicbutter

    Joined:
    Jun 11, 2021
    Posts:
    63
    I've done everything I can think of to improve the performance of the algorithm here. This is my first foray into this type of optimization so I'm sure it can be improved upon. If anyone is interested in reviewing that code and coming up with other improvements to the Pcc tiny use case I'd love to hear about it. You can see the evolution of my optimizations looking at the PccJobv1 through PccJobv6 variations.
     
  2. CodeSmile

    CodeSmile

    Joined:
    Apr 10, 2014
    Posts:
    6,005
    What does the algorithm even do? You don't mention it and the code isn't obvious either.

    It would help to move each version to a separate file.
     
  3. dynamicbutter

    dynamicbutter

    Joined:
    Jun 11, 2021
    Posts:
    63
    Thanks for the feedback SteffenItterheim. The algorithm computes the Pearson Correlation Coefficient (PCC) which is mentioned near the top of the Readme. I had planned to put several algorithms in there (and still may), but for now PCC is all that it does. Have a look at SerialPccv5Tiny (the baseline) and ParallelPccv6Tiny (the best version I came up with). I have incorporated your feedback into v0.0.6, which is main latest as of this moment.
     
  4. CodeSmile

    CodeSmile

    Joined:
    Apr 10, 2014
    Posts:
    6,005
    OMG those formulas. :eek:
    I'm out. :D
     
  5. dynamicbutter

    dynamicbutter

    Joined:
    Jun 11, 2021
    Posts:
    63
    Oh no :(

    Well, if it helps, the two c# implementations of the formula used by the performance tests are in Baseline.cs and aren't very complicated. This is the original baseline, and this is a slightly faster baseline I created for a more apples-to-apples comparison with what I did in the parallel version.
     

    Attached Files:

  6. Enzi

    Enzi

    Joined:
    Jan 28, 2013
    Posts:
    967
    This looks pretty optimized already. (Pccv4) Without loading it up myself, the only thing I could think of for this type of algorithm is to make sure Burst is vectorizing the loops.
    Code (CSharp):
    1. R[i] =
    2.                     (Length * XYSumProd[i] - XResults[0] * YSum[i]) /
    3.                     XResults[1] /
    4.                     (math.sqrt(Length * YYSumProd[i] - YSum[i] * YSum[i]));
    I'd change it to an innerloop of at least 4. There's no vectorization going on with just 1 and it would be perfect for this kind of data.
    Also, I don't see the point of XResults being an array. It's 2 floats, and could possibly screw with the vectorization. If that's the case, change it to 2 parameters instead.
     
  7. dynamicbutter

    dynamicbutter

    Joined:
    Jun 11, 2021
    Posts:
    63
    Thanks Enzi! Changing the merge job to IParallelForBatch and adding an inner loop (Pccv7) resulted in another 5% increase. Now its 134x faster than the single-threaded-non-burst-compiled baseline. When I first started this project I was targeting a 30x improvement so this result is fantastic and is fast enough for my intended purpose.

    The only thing left to do is make this generic (currently hard-coded to float). But if I understand correctly it isn't practical for this kind of highly optimized code because C# doesn't yet have a way to do numeric operations (like +, -, *, etc) with generics. I'm hoping something like...
    public PccJob<T> where T: numeric
    will someday exist.
     
    apkdev likes this.
  8. dynamicbutter

    dynamicbutter

    Joined:
    Jun 11, 2021
    Posts:
    63
    Enzi, I forgot to mention, I had a look at the burst inspector and the loops are vectorized.
     
  9. Enzi

    Enzi

    Joined:
    Jan 28, 2013
    Posts:
    967
    134x! Pretty awesome!
    When you implement a PccJob it gets compiled with that type so generics shouldn't get in your way. Maybe abstract with a struct and an interface to have easier support where you'd implement one of the arithmetics in a method. Important thing is to check that Burst doesn't lose vectorization. Can be secured with Unity.Burst.CompilerServices.Loop.ExpectVectorized(); which needs a define "UNITY_BURST_EXPERIMENTAL_LOOP_INTRINSICS" -> https://docs.unity3d.com/Packages/com.unity.burst@1.4/manual/docs/OptimizationGuidelines.html
     
    apkdev likes this.
  10. apkdev

    apkdev

    Joined:
    Dec 12, 2015
    Posts:
    284
    A feature like this is indeed coming to .NET 6. Hopefully Unity.Mathematics supports this eventually: https://devblogs.microsoft.com/dotnet/preview-features-in-net-6-generic-math/
    Until then you technically can write your own wrapper structs that implement mathematical operations, but without INumber and the static abstract interfaces feature the code quickly gets somewhat unwieldy.
     
    bb8_1 and Anthiese like this.
  11. dynamicbutter

    dynamicbutter

    Joined:
    Jun 11, 2021
    Posts:
    63
    Thanks for the info @apkdev! I agree about unwieldy and will wait for tools to support it before circling back.
     
  12. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Code (CSharp):
    1. R[i] =
    2.                         (Length * XYSumProd[i] - XSum.Value * YSum[i]) /
    3.                         XResult.Value /
    4.                         (math.sqrt(Length * YYSumProd[i] - YSum[i] * YSum[i]));
    You are re-accessing elements in an array multiple times YSum would it not be better to place this value in a variable or would that break the vectorisation?

    Also if over multiple runs you are dividing by XResult.Value that does not change. It may be faster to invert the value and multiply instead as I believe multiplication is faster than division.*

    *It looks like even in SIMD operations Multiplication is way faster (roughly 5-10x) than division Intel® Intrinsics Guide
     
    Last edited: Jun 16, 2022
  13. dynamicbutter

    dynamicbutter

    Joined:
    Jun 11, 2021
    Posts:
    63
    Thanks @Arowx! I guess I sort of assumed the compiler was smart enough to do these kinds of things automatically, but tbh I'm not really sure. I'll run some tests and see how the generated code changes.
     
    Arowx likes this.
  14. dynamicbutter

    dynamicbutter

    Joined:
    Jun 11, 2021
    Posts:
    63
    I got this running on iOS and the results are making me question some things. See this short writeup. Can anyone can explain why the baseline is faster on iOS?

    cc @Enzi , @apkdev , @Arowx
     
    Last edited: Jun 17, 2022
  15. Enzi

    Enzi

    Joined:
    Jan 28, 2013
    Posts:
    967
    The Mac version has been compiled with Mono, right?
    iOS is AOT so it goes through IL2CPP, hence the faster execution.
     
    dynamicbutter likes this.
  16. dynamicbutter

    dynamicbutter

    Joined:
    Jun 11, 2021
    Posts:
    63
    Thanks @Enzi! You are onto something. I recompiled Mac using IL2CPP scripting backend for the player and now the Mac baseline is close to the iOS baseline. But even with IL2CPP the baseline is still a little (30%) slower on the Mac hardware. Not sure if there are other settings to tweak or the iOS compiler is somehow smarter?
     
  17. dynamicbutter

    dynamicbutter

    Joined:
    Jun 11, 2021
    Posts:
    63
    @Arowx ,

    I tried it and it made absolutely no difference in the code produced. Good job Burst compiler team ;)

    Tried this too and it reduced the main loop from 13 to 11 vector math instructions including getting rid of the
    vrcpps
    (reciprocal) instruction. Unfortunately, it did not measurably impact any of the performance tests.
     
  18. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Sounds weird that less instructions run at the same speed as more instructions but the actual instructions we give to a modern CPU can be converted to microcode on chip so maybe the divide by constant in loop was inverted at a lower level?
    Also each instruction can have latency and CPU cycles timings that differ from each other and between CPUs.
     
  19. dynamicbutter

    dynamicbutter

    Joined:
    Jun 11, 2021
    Posts:
    63
    My assumption is the loop is spending a significant portion of its time moving registers around and removing 2 out of 13 vector math instructions just doesn't make a measurable improvement. I think there is a minuscule improvement, but it can't be measured by my test. I did include your change @Arowx in the latest version ;)

    Here is the assembly in case I'm wrong about my assumption and any assembly experts out there are interested in shedding some light on why fewer instructions don't result in faster execution I'd love more info.

    Here is the original version of the loop:
    Code (CSharp):
    1.  
    2. public void Execute(int startIndex, int count)
    3. {
    4.     for (int i = startIndex; i < startIndex + count; ++i) {
    5.         R[i] =
    6.             (Length * XYSumProd[i] - XSum.Value * YSum[i]) /
    7.             XResult.Value /
    8.             (math.sqrt(Length * YYSumProd[i] - YSum[i] * YSum[i]));
    9.     }
    10. }
    11.  
    and here is the assembly produced from the original version of the loop:
    inspector.png


    Here is the reciprocal-multiply version of the loop:
    Code (CSharp):
    1.  
    2. public void Execute(int startIndex, int count)
    3. {
    4.     var recipXResult = 1 / XResult.Value;
    5.     for (int i = startIndex; i < startIndex + count; ++i) {
    6.         R[i] =
    7.             recipXResult *
    8.             (Length * XYSumProd[i] - XSum.Value * YSum[i]) /
    9.             (math.sqrt(Length * YYSumProd[i] - YSum[i] * YSum[i]));
    10.     }
    11. }
    12.  
    and here is the assembly produced from the reciprocal-multiply version of the loop:
    inspector2.png