Branches - how expensive are they?

pvloon · Sep 25, 2012

Hi,

As you might know from showcase I'm doing GPU particles, with potentially millions of particles on screen. I need performance to be as high as possible, but I end up frequently with a question: To branch or not to branch? Take the following:

Code (csharp):

if (someValue != 0)

{

//complicated calculation and such

}

else

{

//simplified calculation, for this scenario (And let's say more than 40% of branches take this path.

}

On CPU I'm sure this would be faster, however, on GPU people seem to stress alot over branches. Would they be slower than doing the actualy calculation? In my case, a example would be:

Code (csharp):

//shortened and simplified a bit

float3 dif = somePos - someOtherPos;

if (rotated)//let's say this happens between 40-60% of the time

{

dif = float3(dot(dif, axisX), dot(dif, axisY), dot(dif, axisZ));

}

Would this be slower of faster without the branch?

Thanks!
-Arthur

(O, and I'm talking about CG)

Aras · Sep 25, 2012

TL;DR: Doing a branch to avoid several dot products is almost certainly going to be slower.

Longer answer:

There are various forms of branching. "static branching" is when the same branch is always taken for the whole draw call - e.g. anything that is based on shader properties and not per-vertex/per-pixel data. These branches are usually "quite ok"; there's some overhead in the branch itself, but otherwise the GPU can happily run exactly the same code for all vertices / pixels.

"Dynamic branching" is when you branch on something that can be different for each vertex or pixel. Now, GPUs process quite a lot of vertices / pixels at once; and one of the ways that makes GPUs fast is because for a bunch of those parallel "work items" they run exactly the same code. If all of them happen to take the same side of a branch - good. If, however, some of them take different branch, then GPU runs both sides of the branch, and just discards some of the results.

So, for dynamic branch to be a win, it has to be quite "coherent" (worst case: different branch for each pixel; good case: same branch for large portions of the screen), and the code that is under the branch has to be quite big. A rule of thumb is probably "30+ instructions". For example a branch to detect if pixel is outside of light's range and to skip whole calculation of BRDF attenuation would make sense - it's both coherent on the screen and skips quite a lot of computation.

To everything above, add a note that some older GPUs / shader models don't really support dynamic branches. E.g. for shader model 2.0 (default in Unity), the compiler will "flatten" the branches, so you're not actually having them in the end.

Martin-Kraus · Sep 25, 2012

In my limited understanding the answer depends very much on your GPU. Old GPUs always computed all branches of an if-else block, more recent GPUs don't (always) do this. This means that in the worst case, your first solution shows the same performance as the second. But there are GPUs which should execute the first solution faster than the second.

Aras · Sep 25, 2012

Martin Kraus said: ↑

In my limited understanding the answer depends very much on your GPU. Old GPUs always computed all branches of an if-else block, more recent GPUs don't (always) do this.
Click to expand...

They still pretty much do this, except the "block" that executes the same code did get a bit smaller (from 1000+ pixes/vertices down to 32 or 64 or something).

This talk - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf - from the excellent Beyond Programmable Shading course (http://bps12.idav.ucdavis.edu/) is an excellent illustration how GPUs work, and why they do so.

Martin-Kraus · Sep 25, 2012

Aras said: ↑

They still pretty much do this, except the "block" that executes the same code did get a bit smaller (from 1000+ pixes/vertices down to 32 or 64 or something).
Click to expand...

If I understand the description by Imagination Technologies correctly, the "block" size is 1 fragment for PowerVR's SGX GPU: "The branching granularity on SGX is one fragment or one vertex. This means you don’t have to worry a lot about making branching coherent for an area of fragments." from page 22 in http://www.imgtec.com/powervr/insid...Development Recommendations.1.8f.External.pdf

EDIT: OK, I got curious. Here is what NVIDIA writes: "GeForce 8800 Series GPUs are designed to process complex DX10 shaders. Programmers will enjoy as fine 16-pixel branching granularity up to 32 pixels in some cases. Compared to the ATI X1900 series, which uses 48 pixel granularity, the GeForce 8800 architecture is far more efficient with 32 pixel granularity for pixel shader programs." from page 32 in http://www.cse.ohio-state.edu/~agrawal/788-su08/Papers/week2/GPU.pdf

It also says that Series 7 had a granularity of 880 fragments.

Aras · Sep 25, 2012

Right, PowerVR is kind of a special beast My guess is however that going forward the branching granularity will increase. As they will be adding more shader execution units to make more powerful GPUs...

pvloon · Sep 25, 2012

Thank you for the extensive answer Aras! That clarifies a lot. I read the paper with great intrest

So, I will definitely try to remove some branches in the shader I use to draw the particles.

However, the small snippet comes from a compute shader. I think those couple of dot products will be to little no matter what, but it raises some questions.

That code is in a (dynamic)for loop, if rotate is false, all particles skip that part of the for loop. Is that considered to be static? Also, in the sense of compute shaders, coherency means a thread group?

Also, dynamic for loops work like branches, afaik? Even if the number of iterations is globally static?

Martin-Kraus · Sep 25, 2012

Aras said: ↑

Right, PowerVR is kind of a special beast My guess is however that going forward the branching granularity will increase. As they will be adding more shader execution units to make more powerful GPUs...
Click to expand...

Maybe; I guess we will know as soon as PowerVR updates its documentation for the PowerVR Series 6 later this year or next year. (I always assumed that the shader engine in Series 5 has already multiple shader execution units. In that case I don't see a reason why adding more of them would make a difference. But since Series 6 appears to include major changes, anything can happen. If it doesn't happen with Series 6, then we probably won't see increasing granularity for a couple of years as they usually increase the number of GPU cores for faster GPUs.)

Farfarer · Sep 25, 2012

Out of curiosity, how does the compiler "flatten" the branch?

Is it essentially the same thing as the GPU would do (compute both paths, then result1 * step(condition) + result2 * 1-step(condition))?

Martin-Kraus · Sep 25, 2012

Farfarer said: ↑

Out of curiosity, how does the compiler "flatten" the branch?

Is it essentially the same thing as the GPU would do (compute both paths, then result1 * step(condition) + result2 * 1-step(condition))?
Click to expand...

Yes.

If you can read ARB vertex and fragment programs (documented here: http://www.opengl.org/registry/specs/ARB/vertex_program.txt and here: http://www.opengl.org/registry/specs/ARB/fragment_program.txt ), look at the compiled shader for this code:

Code (csharp):

Shader "Custom/NewShader" {

SubShader {

Pass {

CGPROGRAM

#pragma vertex vert

#pragma fragment frag

float4 vert(float4 vertexPos : POSITION) : SV_POSITION

{

if (vertexPos.z > 0)

{

return mul(UNITY_MATRIX_MVP, vertexPos);

}

else

{

return mul(UNITY_MATRIX_P, -vertexPos);

}

}

float4 frag(void) : COLOR

{

return float4(1.0, 0.0, 0.0, 1.0);

}

ENDCG

}

}

}

Martin-Kraus · Oct 10, 2012

Aras said: ↑

Right, PowerVR is kind of a special beast My guess is however that going forward the branching granularity will increase. As they will be adding more shader execution units to make more powerful GPUs...
Click to expand...

You are right about coarser granularity for Series 6 PowerVR GPUs, see page 24 of the updated PowerVR performance recommendations:
http://www.imgtec.com/downloadconfi...le=PowerVR.Performance Recommendations.1.0.28

Search Unity

Branches - how expensive are they?

pvloon

Aras

Unity Technologies

Martin-Kraus

Aras

Unity Technologies

Martin-Kraus

Aras

Unity Technologies

pvloon

Martin-Kraus

Farfarer

Martin-Kraus

Martin-Kraus

Search Unity

Unity ID

Useful Searches

Branches - how expensive are they?

Unity Technologies

Unity Technologies

Unity Technologies