Search Unity

Correct use of UNITY_BRANCH

Discussion in 'Shaders' started by Deleted User, Jun 15, 2017.

  1. Deleted User

    Deleted User

    Guest

    I've read quite a bit of discussion about whether or not branching is faster or slower.

    But that's not what this question is about.

    For this question, I just want to know: can I put UNITY_BRANCH before an #if? Or only before a regular if? And what about before a TERNARY operator?

    For example:

    UNITY_BRANCH
    #if defined(blahblah)
    //do 30+ cool commands on a large block of pixels.
    #else
    //do 30+ different, but equally cool commands.
    #endif

    or...

    UNITY_BRANCH
    if (howthisiscool > 0.1)
    {
    //do 30+ cool commands on a large block of pixels.
    }
    else
    {
    //do 30+ different, but equally cool commands.
    }

    OR..

    UNITY_BRANCH
    float AmICool = CoolStance > 1.0 ? YES() : NO(); //<-- Ternary operator that calls two functions.
     
  2. Deleted User

    Deleted User

    Guest

    Reason I ask, is I saw discussions about using it before regular if () statements, but I didn't see anyone talk about using it before #if statements.

    In my specific case, I have a #if defined (UNITY_PASS_SHADOWCASTER), which would obviously be true or false for the entire pass. But I want to make sure the compiler compiles it into a branch. And I'm using #if for this check, so I want to know if I can or should put UNITY_BRANCH before the #if.
     
  3. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,343
    Only before an if, not an #if.

    UNITY_BRANCH
    is for flow control, that is to say "make this conditional work like you expect in a programming language" where only the code that's needed is run for each pixel. The counter to UNITY_BRANCH is UNITY_FLATTEN which makes the shader run both sides of the conditional all of the time and choose a result after doing the work. Those two macros are really just [branch] and [flatten] for HLSL, and do nothing on other platforms.
    https://msdn.microsoft.com/en-us/library/windows/desktop/bb509610(v=vs.85).aspx

    Examples:
    UNITY_BRANCH
    if (screenPos.x < 0.5) {
    // run expensive code A
    } else {
    // run expensive code B
    }
    * On one side of the screen only the code for A will run, and on the other only code B will run.

    UNITY_FLATTEN
    if (screenPos.x < 0.5) {
    // run expensive code A
    } else {
    // run expensive code B
    }
    * Both sides of the screen will run both A and B all of the time, effectively making the shader twice as expensive.

    Reasons for why you would do one over the other come down to some fairly low level GPU stuff. Most of the time you'll want to branch, but for especially cheap calculations the branch itself might be more expensive than just doing both. On modern AMD "GCN" GPUs (any from the last 6 years), a dynamic branch costs 6 instructions in itself, so choosing between two values with a dynamic branch could potentially be more expensive than lerping between the two! Modern shader compilers are pretty good about knowing which option to use, and OpenGL doesn't even have a way to force one option over the other and it's assumed it'll make the correct choice for you (basically it'll always branch) and if you want to flatten it out then write it out flattened yourself. Generally speaking you don't need to use UNITY_BRANCH at all.


    Now #if does something completely different. This isn't a shader compiler thing, but a shader preprocessor conditional directive. In lay terms this means before compiling the shader, include or remove this code. Many programming languages have this, including c# where you can do stuff like #if UNITY_EDITOR to only include and compile a block of code if it's running in the editor. UNITY_BRANCH does nothing for #if, infact it may just end up confusing the compiler. I would expect warnings or even errors from the shader compiler.

    Example:
    fixed4 frag(v2f i) : SV_Target {
    #if defined(SHADER_API_MOBILE)
    // run cheap mobile code
    #else
    // run expensive desktop code
    #endif
    }

    If you build the project for mobile, only the code for mobile will be included in the shaders, the desktop code won't even exist. If you build the project for PC or in the editor only the expensive code will be included in the shader. Effectively you'll have two completely different shaders that look like:

    fixed4 frag(v2f i) : SV_Target {
    // run cheap mobile code
    }

    and

    fixed4 frag(v2f i) : SV_Target {
    // run expensive desktop code
    }

    If you use an #if with a matching #pragma multi_compile then it's effectively making multiple shaders that can be swapped between by setting keywords on the material. Unlike an if, this is something that will be enabled or disabled for the entire draw call and cannot be changed per pixel.

    Example:
    #pragma multi_compile _ MYDEFINE

    fixed4 frag(v2f i) : SV_Target {
    #if defined(MYDEFINE)
    // do code A
    #else
    // do code B
    #endif
    }

    That will create two completely separate shaders with either code A or code B and you can enabled code A by using myMaterial.EnableKeyword("MYDEFINE").

    In the case of stuff like UNITY_PASS_SHADOWCASTER, that's a #define that's placed in the generated shader code for the shadowcaster pass of surface shaders, and all other passes will completely omit that code as they won't have that #define.



    edit - some other random notes:
    On older mobile devices which only support OpenGL ES 2.0, they always "flatten" as they do not support dynamic branching at all. This was true for early desktop hardware as well, which is why there's a stigma against using if conditions in shaders.

    --------------------------------------------

    There is one case where you might want to use UNITY_BRANCH before an #if, and that's when there's an if after that #if, like:

    UNITY_BRANCH
    #if defined(TEST_GREATER)
    if (x > 0.5) {
    #else
    if (x < 0.5) {
    #endif
    // do code
    }

    which will generate a shader with either:

    UNITY_BRANCH
    if (x > 0.5) {
    // do code
    }

    or:

    UNITY_BRANCH
    if (x < 0.5) {
    // do code
    }

    --------------------------------------------

    A case where UNITY_FLATTEN is useful is because derivative values (how much a value changes from one pixel to the one next to it) can't be within a dynamic branch. A simple example would be:

    fixed4 color = fixed4(0,0,0,0);
    UNITY_BRANCH
    if (screenPos.x > 0.5) {

    float2 uv = screenPos * 2.0;
    color = tex2D(_MyTex, uv);
    }

    That'll cause an error as tex2D uses the derivatives of the uvs to determine the mip map to display, but since the uv is only calculated in the if statement it can't guarantee it can calculate the derivative. Using UNITY_FLATTEN would fix this, as would simply writing the above code like this:

    fixed4 color = fixed4(0,0,0,0);
    float2 uv = screenPos * 2.0;
    if (screenPos.x > 0.5) {

    color = tex2D(_MyTex, uv);
    }

    The bottom code has the benefit of not calling tex2D at all if its not needed, where using UNITY_FLATTEN still would, though this is kind of terrible example since likely UNITY_FLATTEN would end up faster on most modern hardware.
     
    Last edited: Jun 15, 2017
  4. Deleted User

    Deleted User

    Guest

    Wow thanks for this reply. Really helpful and answered all my questions.
     
  5. joshuacwilde

    joshuacwilde

    Joined:
    Feb 4, 2018
    Posts:
    727
    Does UNITY_BRANCH still work? It seems to get flattened to a ternary operator regardless, just throwing out the results if the if statement fails. This is on iOS Metal. I would prefer it not do that... I can't make optimizations using if, if it's just gonna throw do both sides anyway. And iOS Metal should be more than capable of handling if branching properly.
     
  6. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,343
    Yes, but ...

    As mentioned above,
    UNITY_BRANCH
    is a macro for HLSL's
    [branch]
    . Direct3D compilers use that as a hint for if it should branch or flatten a conditional or loop. Metal's shading language has no such hint, and will compile the code into whatever form the compiler chooses to.
     
  7. joshuacwilde

    joshuacwilde

    Joined:
    Feb 4, 2018
    Posts:
    727
    It's weird though, because Unity does the compiling right? I am fairly certain Xcode just shows whatever the shader is before Xcode's internal shader compiling. So that would mean that Unity is forcing the flattened if, not Xcode.

    Is there a unity graphics/shader dev that can weigh in here?
     
    Last edited: Apr 23, 2021
  8. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,343
    Yes and no. Unity sends the shader through Microsoft’s shader compiler, then converts that shader into other forms. At least that’s with it does for OpenGL and Vulkan. Not entirely sure what it does for Metal. You can try using
    [branch]
    instead of the macro and see if that changes anything.
     
  9. aleksandrk

    aleksandrk

    Unity Technologies

    Joined:
    Jul 3, 2017
    Posts:
    3,014
    Same as Vulkan and OpenGL.
     
    joshuacwilde and bgolus like this.
  10. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,343
    So I did a little test locally to see what Unity spits out.
    Code (csharp):
    1. Shader "Unlit/BranchVsFlattenTest"
    2. {
    3.     SubShader
    4.     {
    5.         Tags { "RenderType"="Opaque" }
    6.         LOD 100
    7.  
    8.         Pass
    9.         {
    10.             CGPROGRAM
    11.             #pragma vertex vert
    12.             #pragma fragment frag
    13.  
    14.             #pragma multi_compile _ DO_BRANCH DO_FLATTEN
    15.  
    16.             #include "UnityCG.cginc"
    17.  
    18.             float4 vert (float4 vertex : POSITION) : SV_POSITION
    19.             {
    20.                 return UnityObjectToClipPos(vertex);
    21.             }
    22.  
    23.             bool _test;
    24.             float4 _ColorA;
    25.             float4 _ColorB;
    26.  
    27.             fixed4 frag () : SV_Target
    28.             {
    29.                 #if defined(DO_BRANCH)
    30.                 UNITY_BRANCH
    31.                 #elif defined(DO_FLATTEN)
    32.                 UNITY_FLATTEN
    33.                 #endif
    34.                 if (_test)
    35.                     return _ColorA;
    36.                 else
    37.                     return _ColorB;
    38.             }
    39.             ENDCG
    40.         }
    41.     }
    42. }
    In Unity I set the Compile and show code button to output Metal (I'm on Windows) and it spit out the three variants of no hint, branch hint, and flatten hint and produced this code:

    Code (csharp):
    1. // UNITY_BRANCH and no hint produced this
    2.  
    3. fragment Mtl_FragmentOut xlatMtlMain(
    4.     constant FGlobals_Type& FGlobals [[ buffer(0) ]])
    5. {
    6.     Mtl_FragmentOut output;
    7.     if((uint(FGlobals._test))!=uint(0)){
    8.         output.SV_Target0 = FGlobals._ColorA;
    9.         return output;
    10.     } else {
    11.         output.SV_Target0 = FGlobals._ColorB;
    12.         return output;
    13.     }
    14.     return output;
    15. }
    Code (csharp):
    1. // UNITY_FLATTEN produced this
    2.  
    3. fragment Mtl_FragmentOut xlatMtlMain(
    4.     constant FGlobals_Type& FGlobals [[ buffer(0) ]])
    5. {
    6.     Mtl_FragmentOut output;
    7.     bool u_xlatb0;
    8.     u_xlatb0 = FGlobals._test!=0x0;
    9.     output.SV_Target0 = FGlobals._ColorA;
    10.     if(u_xlatb0){return output;}
    11.     output.SV_Target0 = FGlobals._ColorB;
    12.     return output;
    13. }
    So the macro is definitely doing something, and is returning a "real branch", or as close as you can get to one in Metal. So my guess is what you're seeing in Xcode is the shader compiled from that code.
     
  11. joshuacwilde

    joshuacwilde

    Joined:
    Feb 4, 2018
    Posts:
    727
    Hey thanks for doing that test, it seems like it was an odd scenario where it didn't work. After looking through my other shaders it appears to be working correctly there. And I have a different solution anyway for the problematic shader. Thanks for your help!!
     
  12. Azeew

    Azeew

    Joined:
    Jul 11, 2021
    Posts:
    49
    Sorry to reply to this old post, but this really peaked my interest. Why do you say UNITY_FLATTEN would likely end up faster? Wouldn't you be sampling the texture unnecessarily that way, leading to an extra texture sample of overhead (for half the screen)? The more I read about branching, the more confused I get hahaha.
     
  13. Invertex

    Invertex

    Joined:
    Nov 7, 2013
    Posts:
    1,550
    Modern GPU hardware has dedicated texture-mapping units (TMUs) The compiler should be able to see in that situation that the sampling of the texture is not relying on other dependent reads and can queue the texture sampling before the frag program even begins. The sampling is not performed by the parts of the hardware that are processing regular calculations in the fragment code.
    So by ensuring flattening, you avoid the instruction overhead of a branch and aren't really increasing work since all you were avoiding with the branch was a single non-dependent sample. Now, if your shader already had a dozen other texture samples going on and you were managing to overload the TMUs, then it might end up worse.
     
  14. Azeew

    Azeew

    Joined:
    Jul 11, 2021
    Posts:
    49
    Oh wow, that's very interesting! So that kinda means that the first few texture samples of a shader are "free"? Also, by modern GPU hardware do you mean that I can assume most people nowadays are gonna be in this category, or does it mean specifically "high end GPUs"?
    Thanks a lot for the info!
     
  15. Invertex

    Invertex

    Joined:
    Nov 7, 2013
    Posts:
    1,550
    You are still putting some load on the system by having texture samples in the shaders. You're consuming bandwidth and cache. In a very simple example like this it could mostly be free, though the derivatives to determine LOD still have to be calculated since tex2D is being used, if using tex2DLod it would be much closer to "free". But mostly it's just that it could be slightly more performant than a dynamic branch cost is all.

    And yes basically any GPU someone would be using has some dedicated texture mapping stage. In more recent architectures it's a little more complicated now with the work moving to more generalized parts of the core. But the way non-dependent reads are sampled is very optimized. The bigger issue to watch out for is putting too much load on the cache and overall memory bandwidth from lots of texture sampling.
     
  16. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,343
    The difference between tex2D and tex2Dlod on modern GPUs can often be so small as to be unmeasurable. I've done microsecond profiling on shaders and the difference was less than the frame to frame variance.

    But as @Invertex eluded, this can depend heavily on what else the shader is doing.

    The very simple question is what takes more time, sampling a texture unnecessarily, or the cost of the branch? The branch itself is not free and can add several cycles to the shader. Sampling a texture is also not actually entirely free, but most likely the cost of sampling a texture across the entire screen vs only sampling it across half the screen but paying for the cost of the branch across the entire screen, in the end the branch will cost more overall.

    With completely made up numbers, lets assume a branch takes 6 cycles, and a texture sample takes 8 cycles, and the rest of the shader is equal. The shader takes a constant 8 cycles (plus the rest of the shader) in the case of always sampling the texture and throwing out the result. The shader takes either 6 cycles or 14 cycles (plus the rest of the shader) when using a branch. So that's an average of 8 cycles for no branch vs 10 cycles for a branch! And that ignores a lot of additional complexity in how shaders work that may push things even more in the branch-less shader's favor.

    For example, modern GPUs (as anything made in the last 15 years) do a lot of things behind the scenes to try to make texture sampling nearly "free". One of the methods it uses is moving all texture samples to the start of the shader, or as early as possible. The GPU can then work on sampling the texture while the shader continues to calculate other things. If the sample is behind a branch, the shader has to do the work to calculate the values need for the branch and do branch itself, and then sample the texture. So the above comparison may work out in favor of the branch-less shader even more so!

    It gets even more complex because different GPUs will handle texture reads inside a branch differently, and regardless of where the texture sample is in the code will not actually determine when the texture is sampled. The shader compiler will make its own choices about how it should be done, and then the GPU's drivers may make additional modifications beyond that when translating the "compiled" shader into machine code for the GPU. So on some GPUs it may be moot and it'll be sampling the texture all of the time anyway. It's kind of impossible to know without profiling on hardware directly and looking at the shader assembly being used.
     
    TexelBender likes this.
  17. Azeew

    Azeew

    Joined:
    Jul 11, 2021
    Posts:
    49
    @Invertex @bgolus Thanks a lot for the replies. That actually completely changes the way I look at shaders hahahah. I had the impression that texture sampling was significantly more expensive than a branch, or any other ALU work for that matter. Like 100~200 cycles, but apparently that's not really how it goes. I guess I'll start being less conservative with texture sampling haha. Thank you!!