Search Unity

math.select's performance vs. inline-if

Discussion in 'Entity Component System' started by 5argon, May 20, 2018.

  1. 5argon

    5argon

    Joined:
    Jun 10, 2013
    Posts:
    1,555
    It has been highlighted multiple times in talks that math.select would be faster than if conditional when burst compiled. When looking at the source code in GitHub it reveals that `select` maps to in-line if.

    Screenshot 2018-05-20 17.17.15.png

    So, if I use inline-if instead of `if` I could get the same maximum performance? Personally I think inline-if is easier to read. (I am worrying about whether that attribute have any effect or not that makes using `select` more preferable.)
     
  2. Freddx

    Freddx

    Joined:
    Mar 31, 2017
    Posts:
    5
    Actually the [MethodImplOptions(AggressiveInlining)] increase the performance, telling the compiler that should inline the method.

    But well, I think you can use Stopwatch to test the execution time.
     
  3. rz_0lento

    rz_0lento

    Joined:
    Oct 8, 2013
    Posts:
    2,361
    You could implement both in some minimal class and then just examine the generated assembly with Burst Inspector for both and diff those yourself
     
    starikcetin likes this.
  4. LennartJohansen

    LennartJohansen

    Joined:
    Dec 1, 2014
    Posts:
    2,394
    if you write the inline-if in the function direct is it not inlined automatic?
    Does not the [MethodImplOptions(AggressiveInlining)] just make sure the math.select function itself is inlined?
     
  5. andrzej_cadp

    andrzej_cadp

    Joined:
    Jan 27, 2017
    Posts:
    18
    Some architectures have SIMD select instruction, so if you put your code in Job, burst will be able to vectorize your code and avoid branching (even if they don't have it, you can emulate its behavior with set of vectorized instructions). It's probably easier for Burst to catch explicitly used "select" instead of trying to understand the code. This convention also helps us to avoid mistakes leading to poor performance. I'm not aware of such code examples, but I can easily imagine scenario, where over complicated inline if would create branches, while "select" instruction makes sure this won't happen.
     
    Enrico-Monese, 5argon and RaL like this.
  6. xoofx

    xoofx

    Unity Technologies

    Joined:
    Nov 5, 2016
    Posts:
    417
    A single math.select(float, float, bool) compare to a similar if/else should generate similar code.

    The main interest for using select is on SIMD types and more specifically on float4, as it is more naturally mapped to a SIMD register. It will typically result in using dedicated instructions instead of performing conditional move on each components.

    For example, a math.select(float4, float4, bool4) without SIMD would generate the following scalar code (note that you can't generate currently this code in burst using math.select as it will always generate the following SIMD version instead):

    Code (CSharp):
    1. push    rsi
    2. cmp     byte ptr [r8], 0
    3. lea     r9, [rdx + 4]
    4. lea     r11, [rdx + 8]
    5. lea     r10, [rdx + 12]
    6. cmove   rdx, rcx
    7. lea     rax, [rcx + 4]
    8. lea     rsi, [rcx + 8]
    9. lea     rcx, [rcx + 12]
    10. movss   xmm0, dword ptr [rdx]
    11. cmp     byte ptr [r8 + 1], 0
    12. cmovne  rax, r9
    13. movss   xmm1, dword ptr [rax]
    14. cmp     byte ptr [r8 + 2], 0
    15. cmove   r11, rsi
    16. movss   xmm2, dword ptr [r11]
    17. cmp     byte ptr [r8 + 3], 0
    18. cmove   r10, rcx
    19. fld     dword ptr [r10]
    20. pop     rsi
    21. ret
    while having float4 SIMD generates the following code (default in burst):

    Code (CSharp):
    1.  
    2. movups  xmm2, xmmword ptr [rcx]
    3. movups  xmm1, xmmword ptr [rdx]
    4. pmovzxbd        xmm3, dword ptr [r8]
    5. movabs  rax, .LCPI1_0
    6. pand    xmm3, xmmword ptr [rax]
    7. movabs  rax, .LCPI1_1
    8. pand    xmm3, xmmword ptr [rax]
    9. pxor    xmm0, xmm0
    10. pcmpeqd xmm0, xmm3
    11. blendvps        xmm1, xmm2, xmm0
    12. movaps  xmm0, xmm1
    13. ret
    14.  
    But as you can see in the scalar version, it is still using cmove/cmovne for if/else, so a single if/else on a scalar will not be more efficient than a traditional if/else.
     
  7. 5argon

    5argon

    Joined:
    Jun 10, 2013
    Posts:
    1,555
    It is clear now. Thank you so much! I will use the Burst debugger more to see into my jobs from now on.
     
  8. deplinenoise

    deplinenoise

    Unity Technologies

    Joined:
    Dec 20, 2017
    Posts:
    33
    It's worth pointing out that
    bool4
    performs much worse if you store it in memory somewhere. Normally when you use a select, the mask is available in the correct format for SSE/AVX masking (all ones) and LLVM will replace the entire select with a single
    blendv
    instruction.
     
  9. rz_0lento

    rz_0lento

    Joined:
    Oct 8, 2013
    Posts:
    2,361
    @deplinenoise, since you mentioned SSE/AVX, does Burst compile both in same executable or is there compile time switch for it? Obviously we'd want more optimized AVX instructions for most PC computers for example but there are still bunch of CPUs that are not capable and would need fallback. So my real question is, how does Burst handle this?
     
  10. xoofx

    xoofx

    Unity Technologies

    Joined:
    Nov 5, 2016
    Posts:
    417
    Yes, the goal will be to have a dynamic switch at runtime based on the CPU. This feature is not yet available from our work on burst AOT, but it will come soon after the first preview for 2018.2
     
    5argon and rz_0lento like this.
  11. sebas77

    sebas77

    Joined:
    Nov 4, 2011
    Posts:
    1,642
    jumping in this thread to ask if [MethodImplOptions(AggressiveInlining)] is supposed to actually work with the current unity compiler. My test seems to point out it doesn't work, it's maybe a hint for IL2CPP and Burst only?
     
  12. 5argon

    5argon

    Joined:
    Jun 10, 2013
    Posts:
    1,555
    I think it is for Burst only and not with IL2CPP. Burst takes IL (straight?) to optimized assembly code depending on target platform. IL2CPP takes IL to CPP waiting to become (less optimized?) assemblies depending on whatever platform that can do CPP. If both take IL as an input then there should be no overlap in pipeline?