Search Unity

Branches - how expensive are they?

Discussion in 'Shaders' started by pvloon, Sep 25, 2012.

  1. pvloon

    pvloon

    Joined:
    Oct 5, 2011
    Posts:
    586
    Hi,

    As you might know from showcase I'm doing GPU particles, with potentially millions of particles on screen. I need performance to be as high as possible, but I end up frequently with a question: To branch or not to branch? Take the following:


    Code (csharp):
    1.  
    2. if (someValue != 0)
    3. {
    4. //complicated calculation and such
    5. }
    6. else
    7. {
    8. //simplified calculation, for this scenario (And let's say more than 40% of branches take this path.
    9. }
    10.  
    On CPU I'm sure this would be faster, however, on GPU people seem to stress alot over branches. Would they be slower than doing the actualy calculation? In my case, a example would be:

    Code (csharp):
    1.  
    2. //shortened and simplified a bit
    3. float3 dif = somePos - someOtherPos;
    4.  
    5. if (rotated)//let's say this happens between 40-60% of the time
    6. {
    7. dif = float3(dot(dif, axisX), dot(dif, axisY), dot(dif, axisZ));
    8. }
    9.  
    10.  
    Would this be slower of faster without the branch?

    Thanks!
    -Arthur


    (O, and I'm talking about CG)
     
    Last edited: Sep 25, 2012
  2. Aras

    Aras

    Unity Technologies

    Joined:
    Nov 7, 2005
    Posts:
    4,536
    TL;DR: Doing a branch to avoid several dot products is almost certainly going to be slower.

    Longer answer:

    There are various forms of branching. "static branching" is when the same branch is always taken for the whole draw call - e.g. anything that is based on shader properties and not per-vertex/per-pixel data. These branches are usually "quite ok"; there's some overhead in the branch itself, but otherwise the GPU can happily run exactly the same code for all vertices / pixels.

    "Dynamic branching" is when you branch on something that can be different for each vertex or pixel. Now, GPUs process quite a lot of vertices / pixels at once; and one of the ways that makes GPUs fast is because for a bunch of those parallel "work items" they run exactly the same code. If all of them happen to take the same side of a branch - good. If, however, some of them take different branch, then GPU runs both sides of the branch, and just discards some of the results.

    So, for dynamic branch to be a win, it has to be quite "coherent" (worst case: different branch for each pixel; good case: same branch for large portions of the screen), and the code that is under the branch has to be quite big. A rule of thumb is probably "30+ instructions". For example a branch to detect if pixel is outside of light's range and to skip whole calculation of BRDF attenuation would make sense - it's both coherent on the screen and skips quite a lot of computation.

    To everything above, add a note that some older GPUs / shader models don't really support dynamic branches. E.g. for shader model 2.0 (default in Unity), the compiler will "flatten" the branches, so you're not actually having them in the end.
     
  3. Martin-Kraus

    Martin-Kraus

    Joined:
    Feb 18, 2011
    Posts:
    617
    In my limited understanding the answer depends very much on your GPU. Old GPUs always computed all branches of an if-else block, more recent GPUs don't (always) do this. This means that in the worst case, your first solution shows the same performance as the second. But there are GPUs which should execute the first solution faster than the second.
     
  4. Aras

    Aras

    Unity Technologies

    Joined:
    Nov 7, 2005
    Posts:
    4,536
    They still pretty much do this, except the "block" that executes the same code did get a bit smaller (from 1000+ pixes/vertices down to 32 or 64 or something).

    This talk - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf - from the excellent Beyond Programmable Shading course (http://bps12.idav.ucdavis.edu/) is an excellent illustration how GPUs work, and why they do so.
     
  5. Martin-Kraus

    Martin-Kraus

    Joined:
    Feb 18, 2011
    Posts:
    617
    If I understand the description by Imagination Technologies correctly, the "block" size is 1 fragment for PowerVR's SGX GPU: "The branching granularity on SGX is one fragment or one vertex. This means you don’t have to worry a lot about making branching coherent for an area of fragments." from page 22 in http://www.imgtec.com/powervr/insid...Development Recommendations.1.8f.External.pdf

    EDIT: OK, I got curious. Here is what NVIDIA writes: "GeForce 8800 Series GPUs are designed to process complex DX10 shaders. Programmers will enjoy as fine 16-pixel branching granularity up to 32 pixels in some cases. Compared to the ATI X1900 series, which uses 48 pixel granularity, the GeForce 8800 architecture is far more efficient with 32 pixel granularity for pixel shader programs." from page 32 in http://www.cse.ohio-state.edu/~agrawal/788-su08/Papers/week2/GPU.pdf

    It also says that Series 7 had a granularity of 880 fragments.
     
    Last edited: Sep 25, 2012
  6. Aras

    Aras

    Unity Technologies

    Joined:
    Nov 7, 2005
    Posts:
    4,536
    Right, PowerVR is kind of a special beast ;) My guess is however that going forward the branching granularity will increase. As they will be adding more shader execution units to make more powerful GPUs...
     
  7. pvloon

    pvloon

    Joined:
    Oct 5, 2011
    Posts:
    586
    Thank you for the extensive answer Aras! That clarifies a lot. I read the paper with great intrest

    So, I will definitely try to remove some branches in the shader I use to draw the particles.

    However, the small snippet comes from a compute shader. I think those couple of dot products will be to little no matter what, but it raises some questions.

    That code is in a (dynamic)for loop, if rotate is false, all particles skip that part of the for loop. Is that considered to be static? Also, in the sense of compute shaders, coherency means a thread group?

    Also, dynamic for loops work like branches, afaik? Even if the number of iterations is globally static?
     
  8. Martin-Kraus

    Martin-Kraus

    Joined:
    Feb 18, 2011
    Posts:
    617
    Maybe; I guess we will know as soon as PowerVR updates its documentation for the PowerVR Series 6 later this year or next year. (I always assumed that the shader engine in Series 5 has already multiple shader execution units. In that case I don't see a reason why adding more of them would make a difference. But since Series 6 appears to include major changes, anything can happen. If it doesn't happen with Series 6, then we probably won't see increasing granularity for a couple of years as they usually increase the number of GPU cores for faster GPUs.)
     
  9. Farfarer

    Farfarer

    Joined:
    Aug 17, 2010
    Posts:
    2,249
    Out of curiosity, how does the compiler "flatten" the branch?

    Is it essentially the same thing as the GPU would do (compute both paths, then result1 * step(condition) + result2 * 1-step(condition))?
     
  10. Martin-Kraus

    Martin-Kraus

    Joined:
    Feb 18, 2011
    Posts:
    617
    Yes.

    If you can read ARB vertex and fragment programs (documented here: http://www.opengl.org/registry/specs/ARB/vertex_program.txt and here: http://www.opengl.org/registry/specs/ARB/fragment_program.txt ), look at the compiled shader for this code:
    Code (csharp):
    1.  
    2.  Shader "Custom/NewShader" {
    3.    SubShader {
    4.       Pass {
    5.          CGPROGRAM
    6.  
    7.          #pragma vertex vert
    8.          #pragma fragment frag
    9.  
    10.          float4 vert(float4 vertexPos : POSITION) : SV_POSITION
    11.          {
    12.             if (vertexPos.z > 0)
    13.             {
    14.                return mul(UNITY_MATRIX_MVP, vertexPos);
    15.             }
    16.             else
    17.             {
    18.                return mul(UNITY_MATRIX_P, -vertexPos);
    19.             }
    20.          }
    21.  
    22.          float4 frag(void) : COLOR
    23.          {
    24.             return float4(1.0, 0.0, 0.0, 1.0);
    25.          }
    26.  
    27.          ENDCG
    28.       }
    29.    }
    30. }
    31.  
     
  11. Martin-Kraus

    Martin-Kraus

    Joined:
    Feb 18, 2011
    Posts:
    617
    You are right about coarser granularity for Series 6 PowerVR GPUs, see page 24 of the updated PowerVR performance recommendations:
    http://www.imgtec.com/downloadconfi...le=PowerVR.Performance Recommendations.1.0.28