Search Unity

Lot of String.memcpy4 calls when using Matrix4x4. Managed DLL performance problem?

Discussion in 'Scripting' started by Krisztian-Leicht, Mar 28, 2017.

  1. Krisztian-Leicht

    Krisztian-Leicht

    Joined:
    Sep 10, 2015
    Posts:
    6
    tl;dr Use Mono xbuild instead of MsBuild on Windows when you compile your project to a managed dll.

    Recently as I was optimizing our shared codebase (DLL) I found that one of my optimization backfired without no good reason.

    The (simplified) code I was trying to optimize:

    Code (CSharp):
    1. public static float MatrixParamTest(Matrix4x4 input)
    2. {
    3.     Matrix4x4 dummyMatrix = input;
    4.     return Mathf.Max((dummyMatrix * new Vector3(1.0f, 0.0f, 0.0f)).magnitude, (dummyMatrix * new Vector3(0.0f, 1.0f, 0.0f)).magnitude);
    5. }
    After optimization:

    Code (CSharp):
    1. public static float MatrixParamTest(Matrix4x4 input)
    2. {
    3.     Matrix4x4 dummyMatrix = input;
    4.     var matrix00 = dummyMatrix.m00;
    5.     var matrix10 = dummyMatrix.m10;
    6.     var matrix20 = dummyMatrix.m20;
    7.     var matrix30 = dummyMatrix.m30;
    8.     var matrix01 = dummyMatrix.m01;
    9.     var matrix11 = dummyMatrix.m11;
    10.     var matrix21 = dummyMatrix.m21;
    11.     var matrix31 = dummyMatrix.m31;
    12.  
    13.     var left = matrix00 * matrix00 + matrix10 * matrix10 + matrix20 * matrix20 + matrix30 * matrix30;
    14.     var right = matrix01 * matrix01 + matrix11 * matrix11 + matrix21 * matrix21 + matrix31 * matrix31;
    15.  
    16.     return Mathf.Sqrt(left > right ? left : right);
    17. }
    As you can see we saved a bunch of matrix multiplications, implicit casts, sqrt functions yet after I compiled our new method I found that the first one actually run faster.
    A lot actually. For 10000 iterations:
    -Unoptimized version took: 88.50 ms
    -Optimized version took: 151.21 ms
    Basically it ran nearly 2 times slower than the original one.

    I fired up Unity Deep profiler to see what's going on.
    As expected the first code had a lot of Vector3/Matrix4x4 operations yet the method that took the most time was: String.memcpy() 15 ms, 40.000 calls.


    The optimized version contained mostly String.memcpy() methods which took 83 ms, 220.000 calls.


    That was very interesting. First I thought it was because our DLL was compiled in Debug mode. I looked at it and it was actually a Release build. In any case I tried to set it to Debug mode to see if it makes any difference.
    To my surprise it did. It resulted the same code in the profiler just like the Release version but only contained 100.000 memcpy calls. And the method itself finished in 77 ms so it was a bit better than the original one.


    Bit faster but still I was expecting much better result from this optimization. And still nothing explained those memcpy calls. So the next step was move the code from the dll to a script in Unity.
    I ran the same test: 28 ms overall, 20.000 memcpy calls


    Now we're getting somewhere. We've identified that there is some problem with our DLL version.
    Next I opened up ILSpy to see if anything pops up. I have to say I'm no IL expert so I'm just going to write down my observations and leave someone with better knowledge to give proper explanation to what happened over here. ( Since I would like to know it too. :) )

    .method public hidebysig static
    float32 MatrixParamTest (
    valuetype [UnityEngine]UnityEngine.Matrix4x4 input
    ) cil managed
    {
    // Method begins at RVA 0x2050
    // Code size 120 (0x78)
    .maxstack 3
    .locals init (
    [0] float32,
    [1] float32,
    [2] float32,
    [3] float32,
    [4] float32,
    [5] float32,
    [6] float32,
    [7] float32,
    [8] float32,
    [9] float32
    )

    IL_0000: ldarg.0
    IL_0001: dup
    IL_0002: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m00
    IL_0007: stloc.0
    IL_0008: dup
    IL_0009: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m10
    IL_000e: stloc.1
    IL_000f: dup
    IL_0010: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m20
    IL_0015: stloc.2
    IL_0016: dup
    IL_0017: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m30
    IL_001c: stloc.3
    IL_001d: dup
    IL_001e: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m01
    IL_0023: stloc.s 4
    IL_0025: dup
    IL_0026: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m11
    IL_002b: stloc.s 5
    IL_002d: dup
    IL_002e: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m21
    IL_0033: stloc.s 6
    IL_0035: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m31
    IL_003a: stloc.s 7
    IL_003c: ldloc.0
    IL_003d: ldloc.0
    IL_003e: mul
    IL_003f: ldloc.1
    IL_0040: ldloc.1
    IL_0041: mul
    IL_0042: add
    IL_0043: ldloc.2
    IL_0044: ldloc.2
    IL_0045: mul
    IL_0046: add
    IL_0047: ldloc.3
    IL_0048: ldloc.3
    IL_0049: mul
    IL_004a: add
    IL_004b: stloc.s 8
    IL_004d: ldloc.s 4
    IL_004f: ldloc.s 4
    IL_0051: mul
    IL_0052: ldloc.s 5
    IL_0054: ldloc.s 5
    IL_0056: mul
    IL_0057: add
    IL_0058: ldloc.s 6
    IL_005a: ldloc.s 6
    IL_005c: mul
    IL_005d: add
    IL_005e: ldloc.s 7
    IL_0060: ldloc.s 7
    IL_0062: mul
    IL_0063: add
    IL_0064: stloc.s 9
    IL_0066: ldloc.s 8
    IL_0068: ldloc.s 9
    IL_006a: bgt.s IL_0070

    IL_006c: ldloc.s 9
    IL_006e: br.s IL_0072

    IL_0070: ldloc.s 8

    IL_0072: call float32 [UnityEngine]UnityEngine.Mathf::Sqrt(float32)
    IL_0077: ret
    } // end of method MatrixExtension::MatrixParamTest
    .method public hidebysig static
    float32 MatrixParamTest (
    valuetype [UnityEngine]UnityEngine.Matrix4x4 input
    ) cil managed
    {
    // Method begins at RVA 0x2050
    // Code size 132 (0x84)
    .maxstack 3
    .locals init (
    [0] valuetype [UnityEngine]UnityEngine.Matrix4x4,
    [1] float32,
    [2] float32,
    [3] float32,
    [4] float32,
    [5] float32,
    [6] float32,
    [7] float32,
    [8] float32,
    [9] float32,
    [10] float32,
    [11] float32
    )

    IL_0000: nop
    IL_0001: ldarg.0
    IL_0002: stloc.0
    IL_0003: ldloc.0
    IL_0004: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m00
    IL_0009: stloc.1
    IL_000a: ldloc.0
    IL_000b: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m10
    IL_0010: stloc.2
    IL_0011: ldloc.0
    IL_0012: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m20
    IL_0017: stloc.3
    IL_0018: ldloc.0
    IL_0019: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m30
    IL_001e: stloc.s 4
    IL_0020: ldloc.0
    IL_0021: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m01
    IL_0026: stloc.s 5
    IL_0028: ldloc.0
    IL_0029: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m11
    IL_002e: stloc.s 6
    IL_0030: ldloc.0
    IL_0031: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m21
    IL_0036: stloc.s 7
    IL_0038: ldloc.0
    IL_0039: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m31
    IL_003e: stloc.s 8
    IL_0040: ldloc.1
    IL_0041: ldloc.1
    IL_0042: mul
    IL_0043: ldloc.2
    IL_0044: ldloc.2
    IL_0045: mul
    IL_0046: add
    IL_0047: ldloc.3
    IL_0048: ldloc.3
    IL_0049: mul
    IL_004a: add
    IL_004b: ldloc.s 4
    IL_004d: ldloc.s 4
    IL_004f: mul
    IL_0050: add
    IL_0051: stloc.s 9
    IL_0053: ldloc.s 5
    IL_0055: ldloc.s 5
    IL_0057: mul
    IL_0058: ldloc.s 6
    IL_005a: ldloc.s 6
    IL_005c: mul
    IL_005d: add
    IL_005e: ldloc.s 7
    IL_0060: ldloc.s 7
    IL_0062: mul
    IL_0063: add
    IL_0064: ldloc.s 8
    IL_0066: ldloc.s 8
    IL_0068: mul
    IL_0069: add
    IL_006a: stloc.s 10
    IL_006c: ldloc.s 9
    IL_006e: ldloc.s 10
    IL_0070: bgt.s IL_0076

    IL_0072: ldloc.s 10
    IL_0074: br.s IL_0078

    IL_0076: ldloc.s 9

    IL_0078: call float32 [UnityEngine]UnityEngine.Mathf::Sqrt(float32)
    IL_007d: stloc.s 11
    IL_007f: br.s IL_0081

    IL_0081: ldloc.s 11
    IL_0083: ret
    } // end of method MatrixExtension::MatrixParamTest
    .method public hidebysig static
    float32 MatrixParamTest (
    valuetype [UnityEngine]UnityEngine.Matrix4x4 input
    ) cil managed
    {
    // Method begins at RVA 0x2058
    // Code size 149 (0x95)
    .maxstack 3
    .locals init (
    [0] valuetype [UnityEngine]UnityEngine.Matrix4x4,
    [1] float32,
    [2] float32,
    [3] float32,
    [4] float32,
    [5] float32,
    [6] float32,
    [7] float32,
    [8] float32,
    [9] float32,
    [10] float32,
    [11] float32
    )

    IL_0000: nop
    IL_0001: ldarg.0
    IL_0002: stloc.0
    IL_0003: ldloca.s 0
    IL_0005: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m00
    IL_000a: stloc.1
    IL_000b: ldloca.s 0
    IL_000d: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m10
    IL_0012: stloc.2
    IL_0013: ldloca.s 0
    IL_0015: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m20
    IL_001a: stloc.3
    IL_001b: ldloca.s 0
    IL_001d: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m30
    IL_0022: stloc.s 4
    IL_0024: ldloca.s 0
    IL_0026: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m01
    IL_002b: stloc.s 5
    IL_002d: ldloca.s 0
    IL_002f: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m11
    IL_0034: stloc.s 6
    IL_0036: ldloca.s 0
    IL_0038: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m21
    IL_003d: stloc.s 7
    IL_003f: ldloca.s 0
    IL_0041: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m31
    IL_0046: stloc.s 8
    IL_0048: ldloc.1
    IL_0049: ldloc.1
    IL_004a: mul
    IL_004b: ldloc.2
    IL_004c: ldloc.2
    IL_004d: mul
    IL_004e: add
    IL_004f: ldloc.3
    IL_0050: ldloc.3
    IL_0051: mul
    IL_0052: add
    IL_0053: ldloc.s 4
    IL_0055: ldloc.s 4
    IL_0057: mul
    IL_0058: add
    IL_0059: stloc.s 9
    IL_005b: ldloc.s 5
    IL_005d: ldloc.s 5
    IL_005f: mul
    IL_0060: ldloc.s 6
    IL_0062: ldloc.s 6
    IL_0064: mul
    IL_0065: add
    IL_0066: ldloc.s 7
    IL_0068: ldloc.s 7
    IL_006a: mul
    IL_006b: add
    IL_006c: ldloc.s 8
    IL_006e: ldloc.s 8
    IL_0070: mul
    IL_0071: add
    IL_0072: stloc.s 10
    IL_0074: ldloc.s 9
    IL_0076: ldloc.s 10
    IL_0078: ble.un IL_0084

    IL_007d: ldloc.s 9
    IL_007f: br IL_0086

    IL_0084: ldloc.s 10

    IL_0086: call float32 [UnityEngine]UnityEngine.Mathf::Sqrt(float32)
    IL_008b: stloc.s 11
    IL_008d: br IL_0092

    IL_0092: ldloc.s 11
    IL_0094: ret
    } // end of method MatrixExtension::MatrixParamTest

    As you can see the basic difference between these calls are the opcodes how the matrix is loaded.
    VSRelease version using OpCodes.Dup.
    VSDebug version using combination of OpCodes.Stloc and OpCodes.Ldloc
    Unity version using combination of OpCodes.Stloc and OpCodes.Ldloca

    Next thing that I wanted to check is wheter the problem still occours if I'm using mono compiler (xbuild) instead of VS (MsBuild). Some information about the versions:

    Visual Studio Version:
    ----------------------------
    Microsoft Visual Studio Community 2015
    Version 14.0.25431.01 Update 3

    Mono Version:
    ------------------
    XBuild Engine Version 14.0
    Mono, Version 4.8.0.0

    Mono Debug result:


    Mono Release result:


    .method public hidebysig static
    float32 MatrixParamTest (
    valuetype [UnityEngine]UnityEngine.Matrix4x4 input
    ) cil managed
    {
    // Method begins at RVA 0x2050
    // Code size 132 (0x84)
    .maxstack 3
    .locals init (
    [0] valuetype [UnityEngine]UnityEngine.Matrix4x4,
    [1] float32,
    [2] float32,
    [3] float32,
    [4] float32,
    [5] float32,
    [6] float32,
    [7] float32,
    [8] float32,
    [9] float32,
    [10] float32,
    [11] float32
    )

    IL_0000: nop
    IL_0001: ldarg.0
    IL_0002: stloc.0
    IL_0003: ldloc.0
    IL_0004: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m00
    IL_0009: stloc.1
    IL_000a: ldloc.0
    IL_000b: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m10
    IL_0010: stloc.2
    IL_0011: ldloc.0
    IL_0012: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m20
    IL_0017: stloc.3
    IL_0018: ldloc.0
    IL_0019: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m30
    IL_001e: stloc.s 4
    IL_0020: ldloc.0
    IL_0021: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m01
    IL_0026: stloc.s 5
    IL_0028: ldloc.0
    IL_0029: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m11
    IL_002e: stloc.s 6
    IL_0030: ldloc.0
    IL_0031: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m21
    IL_0036: stloc.s 7
    IL_0038: ldloc.0
    IL_0039: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m31
    IL_003e: stloc.s 8
    IL_0040: ldloc.1
    IL_0041: ldloc.1
    IL_0042: mul
    IL_0043: ldloc.2
    IL_0044: ldloc.2
    IL_0045: mul
    IL_0046: add
    IL_0047: ldloc.3
    IL_0048: ldloc.3
    IL_0049: mul
    IL_004a: add
    IL_004b: ldloc.s 4
    IL_004d: ldloc.s 4
    IL_004f: mul
    IL_0050: add
    IL_0051: stloc.s 9
    IL_0053: ldloc.s 5
    IL_0055: ldloc.s 5
    IL_0057: mul
    IL_0058: ldloc.s 6
    IL_005a: ldloc.s 6
    IL_005c: mul
    IL_005d: add
    IL_005e: ldloc.s 7
    IL_0060: ldloc.s 7
    IL_0062: mul
    IL_0063: add
    IL_0064: ldloc.s 8
    IL_0066: ldloc.s 8
    IL_0068: mul
    IL_0069: add
    IL_006a: stloc.s 10
    IL_006c: ldloc.s 9
    IL_006e: ldloc.s 10
    IL_0070: bgt.s IL_0076

    IL_0072: ldloc.s 10
    IL_0074: br.s IL_0078

    IL_0076: ldloc.s 9

    IL_0078: call float32 [UnityEngine]UnityEngine.Mathf::Sqrt(float32)
    IL_007d: stloc.s 11
    IL_007f: br.s IL_0081

    IL_0081: ldloc.s 11
    IL_0083: ret
    } // end of method MatrixExtension::MatrixParamTest
    .method public hidebysig static
    float32 MatrixParamTest (
    valuetype [UnityEngine]UnityEngine.Matrix4x4 input
    ) cil managed
    {
    // Method begins at RVA 0x2058
    // Code size 139 (0x8b)
    .maxstack 3
    .locals init (
    [0] valuetype [UnityEngine]UnityEngine.Matrix4x4,
    [1] float32,
    [2] float32,
    [3] float32,
    [4] float32,
    [5] float32,
    [6] float32,
    [7] float32,
    [8] float32,
    [9] float32,
    [10] float32
    )

    IL_0000: ldarg.0
    IL_0001: stloc.0
    IL_0002: ldloca.s 0
    IL_0004: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m00
    IL_0009: stloc.1
    IL_000a: ldloca.s 0
    IL_000c: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m10
    IL_0011: stloc.2
    IL_0012: ldloca.s 0
    IL_0014: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m20
    IL_0019: stloc.3
    IL_001a: ldloca.s 0
    IL_001c: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m30
    IL_0021: stloc.s 4
    IL_0023: ldloca.s 0
    IL_0025: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m01
    IL_002a: stloc.s 5
    IL_002c: ldloca.s 0
    IL_002e: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m11
    IL_0033: stloc.s 6
    IL_0035: ldloca.s 0
    IL_0037: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m21
    IL_003c: stloc.s 7
    IL_003e: ldloca.s 0
    IL_0040: ldfld float32 [UnityEngine]UnityEngine.Matrix4x4::m31
    IL_0045: stloc.s 8
    IL_0047: ldloc.1
    IL_0048: ldloc.1
    IL_0049: mul
    IL_004a: ldloc.2
    IL_004b: ldloc.2
    IL_004c: mul
    IL_004d: add
    IL_004e: ldloc.3
    IL_004f: ldloc.3
    IL_0050: mul
    IL_0051: add
    IL_0052: ldloc.s 4
    IL_0054: ldloc.s 4
    IL_0056: mul
    IL_0057: add
    IL_0058: stloc.s 9
    IL_005a: ldloc.s 5
    IL_005c: ldloc.s 5
    IL_005e: mul
    IL_005f: ldloc.s 6
    IL_0061: ldloc.s 6
    IL_0063: mul
    IL_0064: add
    IL_0065: ldloc.s 7
    IL_0067: ldloc.s 7
    IL_0069: mul
    IL_006a: add
    IL_006b: ldloc.s 8
    IL_006d: ldloc.s 8
    IL_006f: mul
    IL_0070: add
    IL_0071: stloc.s 10
    IL_0073: ldloc.s 9
    IL_0075: ldloc.s 10
    IL_0077: ble.un IL_0083

    IL_007c: ldloc.s 9
    IL_007e: br IL_0085

    IL_0083: ldloc.s 10

    IL_0085: call float32 [UnityEngine]UnityEngine.Mathf::Sqrt(float32)
    IL_008a: ret
    } // end of method MatrixExtension::MatrixParamTest

    As you can see Mono.Debug was the same as VS.Debug and Mono.Release was the same as the Unity compiled (Assembly-CSharp) version. I tried to use mcs.exe that comes with Unity it had the same performance as the Mono.Release build.

    Next thing I wanted to check is wheter these differences come up in an actual build. Since we're doing mostly mobile development I tested this on Android with Mono/IL2CPP.
    Since IL2CPP was so much faster I needed to increase the iterations to 1.000.000 cycles to have some meaningful values.

    Android.Mono (1.000.000 iterations):

    As you can see in Mono version there are basically the same two groups. Except VS.Release somehow got much better.

    Android.IL2CPP (1.000.000 iterations):

    In IL2CPP however there was no such problem. Every version performed the same.

    Editor End results:

    -----------------


    My final conclusion is that those memcpy calls are actually matrix copies as suspected here. Still not sure what caused those differences or if its just a specific problem, so if anyone has some thoughts please share it with me. I attached the sample project I was using if anyone wants to give it a try.
     

    Attached Files:

  2. Dave-Carlile

    Dave-Carlile

    Joined:
    Sep 16, 2012
    Posts:
    967
    Why are you making a copy of the input parameter?

    Code (csharp):
    1. Matrix4x4 dummyMatrix = input;
    Why not just operate off of input? That'll save a copy of the full structure.

    Another optimization would be to pass the matrix as a ref parameter which will save another copy.

    Code (csharp):
    1. public static float MatrixParamTest(ref Matrix4x4 input)
     
  3. Krisztian-Leicht

    Krisztian-Leicht

    Joined:
    Sep 10, 2015
    Posts:
    6
    You are totally right about the second one.

    The first one was needed to try to mimic the original behaviour.
    Lets say that it's a Unity function that will always return with a new copy.

    Code (CSharp):
    1. public static float Calculate(Transform someTransform)
    2. {
    3.     Matrix4x4 dummyMatrix = someTransform.localToWorldMatrix;
    4.     // Rest of the code
    5. }
    6.  
    Anyway thanks to your answer it makes much more sense now why are those VS compiled versions made so much calls.
    Looking at IL code Ldloc probably made a whole new copy each time I was trying to get one of its properties.
     
  4. Dave-Carlile

    Dave-Carlile

    Joined:
    Sep 16, 2012
    Posts:
    967
    But you're returning a float, not a matrix. Doesn't make sense to make that copy. Even if you were returning a matrix, these are structs (not classes) so you're always returning copies.
     
  5. Krisztian-Leicht

    Krisztian-Leicht

    Joined:
    Sep 10, 2015
    Posts:
    6
    I think you misunderstood what I meant last time.
    As I mentioned the original code looked more like this:

    Code (CSharp):
    1. public static float Calculate(Transform someTransform)
    2. {
    3.     Matrix4x4 dummyMatrix = someTransform.localToWorldMatrix; // 1. <- cache this so you don't have to call it multiple times below as this will result a new Matrix4x4 every time.
    4.     float matrix00 = dummyMatrix.m00; // 2. <- simple float field access. Yet when I use VS version it results a matrix copy, but with Mono version it does not.
    5.     float matrix10 = dummyMatrix.m10;
    6.     float matrix20 = dummyMatrix.m20;
    7.     float matrix30 = dummyMatrix.m30;
    8.     float matrix01 = dummyMatrix.m01;
    9.     float matrix11 = dummyMatrix.m11;
    10.     float matrix21 = dummyMatrix.m21;
    11.     float matrix31 = dummyMatrix.m31;
    12.  
    13.     float left = matrix00 * matrix00 + matrix10 * matrix10 + matrix20 * matrix20 + matrix30 * matrix30;
    14.     float right = matrix01 * matrix01 + matrix11 * matrix11 + matrix21 * matrix21 + matrix31 * matrix31;
    15.  
    16.     return Mathf.Sqrt(left > right ? left : right);
    17. }
    Still the point I was trying to make is that if you look at the numbers on the last picture the same code resulted different performance/behaviour between Mono/VS compiler. ( as mentioned in point 2 above )
    And I was curious whether its some sort of project setting that I'm missing or just compiler difference.