Search Unity

Question W-component in opaque fragment shader

Discussion in 'Shaders' started by YegorStepanov, Jan 11, 2024.

  1. YegorStepanov

    YegorStepanov

    Joined:
    Oct 10, 2017
    Posts:
    15
    Does it make sense to leave the fourth component not equal to 1 in the opaque shaders?

    half4 frag(v2f i)
    {
    half4 tex = tex2D(_MainTex, i.uv);
    return tex;
    }

    If we change it to this:
    half4 frag(v2f i)
    {
    half4 tex = tex2D(_MainTex, i.uv);
    return half4(tex, 1);
    }

    The compiler will change `tex` type from half4 to half3. This could be a nice and noticeable bonus in more complex shaders, because it will remove a quarter of the alu calculations (ideally).

    Didn't find any information on the Internet.
     
  2. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,339
    You didn’t find anything about it because for this very simple case, the second example is slower.

    Why?

    Sampling a texture always returns a float4 value. Always. Even if the texture being sampled doesn’t have 4 channels. Even if the variable you’re setting isn’t a float4. On PC the conversion from float4 to half4 is free… because they’re the same thing. On mobile, the conversions have a minor cost, but they’re handled efficiently for texture sample values and the performance gains from using half precision in any math you do in the shader outweigh the conversion cost.

    Having the shader return a “half4” with the w overridden to 1 requires an additional instruction to combine the xyz values from tex, and a 1 over the w. Otherwise the shader just returns the vector as is. So it’s at least one more instruction than the first example, possibly more depending on how the GPU handles it or if you use a value that isn’t 1.0. (A handful of values are special and can be used for free in a shader, like 1.0. But values have to be “created” in another instruction.)

    Lastly, the fixed function code that actually makes use of the output of the shader has no way of knowing the w is hardcoded. It’ll do any math it needs to do with the output values regardless of what they are. It being a 1.0 because it was hardcoded vs a 1.0 sampled from a texture is irrelevant.
     
    YegorStepanov likes this.
  3. YegorStepanov

    YegorStepanov

    Joined:
    Oct 10, 2017
    Posts:
    15
    Thank you very much for your answer, bgolus.

    What is the approximate speed of floatToHalf conversation? Looking at the C++ implementation, I would say it's like 10-20 additions. But GPUs probably have super-fast and not entirely accurate intrinsics for such operation.

    Do I understand correctly that the example below will be faster by two operations if we load 1.0 to the alpha? We will remove 3 multiplication operations, but add one instruction to load 1.0 into the w component.

    I'm guessing that single component instruction is 4 times faster than vector4 instruction. I called a one-component `sub-instruction` an operation.

    Also, I assume that a multiplication instruction takes more cycles than a simple instruction like load

    Code (c):
    1. uniform half Value1;
    2. uniform half Value2;
    3. uniform half Value3;
    4.  
    5. half4 frag(v2f i)
    6. {
    7.     half4 tex = tex2D(_MainTex, i.uv);
    8.     tex *= Value1;
    9.     tex *= Value2;
    10.     tex *= Value3;
    11.     return half4(tex, 1);
    12. }
     
  4. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,339
    For desktop GPUs, it's free because there is no conversion being done. All floating point values are handled as full single precision floats. The "half" designation is a lie, because HLSL doesn't have a real half type.

    https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl-scalar
    Similarly, the
    fixed
    type is something that doesn't exist in HLSL at all, and is something Unity added in their code. It's
    #define fixed half
    , meaning it's also just a single precision float.

    What HLSL does have is a
    min16float
    type. This means "at least 16 bit precision", which on most desktop hardware will be a full single precision float, because that fullfills the requirement of "at least" 16 bits. The last two generations of desktop GPUs have had actual support for 16 bit floating point math, but it's so buggy and inconsistently implemented, it's off by default. On mobile, many GPUs do have a real half type, so these will end up as those types.

    As for the conversion cost on mobile, the actual cost depends on the hardware. There's not necessarily an intrinsic (which, humorously, HLSL does have), but rather the conversion is handled in the hardware nearly for free. It's of course not actually free, but for values returned from a texture sample it'll be converted before the shader ever sees it, and texture sampling is always variable cost, so it's effectively free. Conversions within the shader may cost an additional cycle or two, but you'd have to benchmark your target hardware to find out.

    It can be. Yes. If the GPU its running on is using a scalar architecture. AMD's RDNA architecture is scalar. GNC is scalar-vector, and previous generations were vector. Intel is generally scalar AFAIK. Nvidia's last several generations are assumed to be scalar, but Nvidia doesn't like telling anyone for sure. Mobile is all over the place.

    If it's a scalar architecture, then yes, hardcoding the output to be a 1 is faster than not. Also your original example didn't do any additional math to the color, which is why it was always worse.

    However, I should note that in both examples where you override the w, that code won't compile.
    Code (csharp):
    1. half4 tex = ...
    2. return half4(tex, 1.0);
    That'll give you the error "incorrect number of arguments to numeric-type constructor", because you're passing 5 values to the
    half4()
    constructor. The 4 from the
    tex
    , and the 1. You want:
    Code (csharp):
    1. half4(tex.rgb, 1.0);
    You are however correct that you don't need to make the
    tex
    variable a
    half3
    for the performance gain, as the compiler will realize the w component is never used and strip out any operations done to it from the compiled shader. However I'd say it's best practice to use a half3 to begin with if you know from the start you're never going to use the alpha from the texture.

    So instead of:
    Code (csharp):
    1. half4 tex = tex2D(_MainTex, i.uv);
    2. tex *= foo;
    3. tex *= bar;
    4. return half4(tex.rgb, 1); // need to be explicit about only using the rgb or xyz components
    It's better to be clear from the start with:
    Code (csharp):
    1. half3 tex = tex2D(_MainTex, i.uv); // conversion from 4 to 3 components is implicit
    2. tex *= foo;
    3. tex *= bar;
    4. return half4(tex, 1);
    These will perform identically, but there will be no chance for error later if you explicitly never want the alpha from that texture to be used.

    Lastly, the ALU performance on modern GPUs is so insanely high these days, even on mobile, that micro optimizations like this rarely actually result in any meaningful performance gains. They're good to be aware for, for sure. But this is hyper-optimizations, like grouping float and vector math operations together rather than interleaving them. Usually the big performance impacts come from more broader strokes.