Search Unity

Use array or texture in my shader?

Discussion in 'Shaders' started by sylon, Mar 19, 2019.

  1. sylon

    sylon

    Joined:
    Mar 5, 2017
    Posts:
    242
    I am working on a surface shader that i'm sending data to. It processes this data in the surf function.
    I was wondering, what would be more efficient to use:

    Sending a Vector4 array to my shader (in Update) and reading that float4[] in a loop.
    Or do the same thing, but with a small texture?
    I would suspect that having x texture lookups per pixel would be heavier?
     
  2. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    6,846
    The texture is a little heavier, yes, especially if you plan on iterating over all of that data for every pixel being rendered. For a hand full of samples, there may be no real difference.
     
  3. jbooth

    jbooth

    Joined:
    Jan 6, 2014
    Posts:
    3,534
    There's kind of a cross over point though- In MegaSplat I have the problem of needing to store many per-texture properties for 256 textures. Besides there being no good way to store this data with the material other than a texture (Unity doesn't support array properties), setting them on the material and looking them up seemed to be slower than using a texture. My guess is that at some point fetching from where that data is stored is slower than looking in a small texture which is heavily cached.

    I've also found really oddball things with using arrays in shaders. For instance, in MicroSplat, which at the time only supported 16 textures (now 32), I thought I'd switch back to properties. Originally I though oh, maybe I'll store each value as 4 Vector4's, and do some macro foo to select the proper value from the arrays via index. That way I don't need to manage this texture for looking up the properties. So basically each value is:

    half4 Value0;
    half4 Value1;
    half4 Value2;
    half4 Value3;

    and using some macro foo, I can do something like

    GET(Value, 13);

    which turns into Value3.y;

    It worked, but was unbelievably slow compared to the texture approach.
     
  4. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    6,846
    That's a more unique case though. If you were actually using an array, it'd be significantly faster. Selecting a single component at runtime from a handful of float4 uniforms is going to be slower no matter how you swing it; you're going to be adding a lot of extra instructions constructing arrays, doing equal comparisons, or dot products. Either way you're forcing the shader to access the data of all of the uniforms compared to directly accessing an existing array index.

    The cheapest I could think to do this is around 16 instructions, vs 1 for an array access.

    Code (csharp):
    1. float GetValue(int index)
    2. {
    3.   float4 values[4] = {_Value0, _Value1, _Value2, _Value3};
    4.   int indexUniform = index / 4;
    5.   int indexComponent = index % 4;
    6.   return values[indexUniform][indexComponent]; // this is actually several instructions, including a dot product
    7. }
     
    Last edited: Mar 20, 2019
  5. sylon

    sylon

    Joined:
    Mar 5, 2017
    Posts:
    242
    Thanks guys for your insight.
    I'll try to profile the difference.

    Since i am iterating in the surf function, i understand it would execute for every pixel.
    I am now trying to filter conditions first in the vertex function.
    Test there if the affected pixels are close enough to the vertex position. And pass that (1/0) to the surf function.
    Haven't got that working yet though.
    And i can't think of any other way to skip my loop than with an if statement.
    But surely, in this case, an if statement is cheaper than looping?

    Really like this stuff though :) hard to stop experimenting.
     
  6. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    6,846
    This is a fool's errand outside of very specific, and limited cases. The data the fragment shader (and thus the surf function, which is justed called in the fragment shader function) is interpolated between 3 vertices, so you can't just pass a single index or set of values out of the vertex shader and get anything useful in the surf function. There are ways to reconstruct a limited set of data, like a single index per vertex, but it requires significant mesh processing before hand. @jbooth 's MegaSplat does this, and it works great, but it only needs to reconstruct the 3 texture array indices that it'll blend between.

    Really what you'd want to do is more akin to something like tiled or clustered lighting systems where lights are pre-binned into a grid that the shader then looks up into and iterates over. You could conceivably do this per triangle and use the SV_PrimitiveID (triangle index) to access the data, but you can't get the primitive ID in a surface shader. You'd have to rely on local, world, or UV position.

    I'm not entirely sure what you're looking to do, but there's a ton of literature on tiled and clustered lighting. Depending on your specific use case these may be overkill.

    Depends on how the shader compiler (and GPU) decides to handle the "if". In a lot of cases an if in a shader isn't actually a branch as you would expect from traditional CPU centric languages. An example:
    Code (csharp):
    1. float dist = 0;
    2. if (_ValueCount > 0)
    3. {
    4.     for (int i = 0; i < _ValueCount; i++)
    5.         dist = max(dist, length(pos.xyz - _Values[i].xyz));
    6. }
    In C#, the expectation would be that the loop never runs. Though, if the count was zero, it's unnecessary code as it would be skipped anyway... we'll ignore that for now. The problem is the compiled shader may do that, or it may choose to reorder the code to look more like this:
    Code (csharp):
    1. float dist = 0;
    2. float tempDist = 0;
    3. for (int i = 0; i < _ValueCount; i++)
    4.     tempDist = max(dist, length(pos.xyz - _Values[i].xyz));
    5. dist = _ValueCount > 0 ? tempDist : dist;
    In fact, the loop might not even be a loop, but "flattened", like:
    Code (csharp):
    1. tempDist = max(dist, length(pos.xyz - _Values[0].xyz));
    2. dist = 0 < _ValueCount ? tempDist : dist;
    3.  
    4. tempDist = max(dist, length(pos.xyz - _Values[1].xyz));
    5. dist = 1 < _ValueCount ? tempDist : dist;
    6.  
    7. tempDist = max(dist, length(pos.xyz - _Values[2].xyz));
    8. dist = 2 < _ValueCount ? tempDist : dist;
    9.  
    10. // etc to some arbitrary max count the shader compiler decides.
    11. // this means if the compiler decides to do 100 iterations, and your _ValueCount is 5
    12. // it's still doing 100 iterations!
    To be fair, that last outcome is what you should expect from an OpenGLES 2.0 or DirectX 9 device rather than a modern GL 3.0, GLES 3.0 or DX11 GPU, which is more likely to not unroll loops, but may still not do a "real" branch if the compiler doesn't think it'll be expensive enough to warrant one.
     
  7. jbooth

    jbooth

    Joined:
    Jan 6, 2014
    Posts:
    3,534
    The real problem with branching is with texture access- you can use the gradient or lod samplers instead of the regular ones so you can branch and access textures, but if your not careful, you can break the quad pixel optimization the GPU does to share texture samples between neighboring pixels, which can really slow things down. However, executing both sides of the branch on simple code is usually not that big of a deal- modern CPUs will do that as well sometimes, because quite frankly calculation is rarely the bottleneck (memory access is). It can still be a bottleneck in shaders, but modern GPUs are pretty beastly things.

    As an example, one of the first things MicroSplat does is a giant loop like this:

    Code (CSharp):
    1.  
    2.          int i = 0;
    3.          for (i = 0; i < TEXCOUNT; ++i)
    4.          {
    5.             fixed w = splats[i];
    6.             if (w >= weights[0])
    7.             {
    8.                weights[3] = weights[2];
    9.                indexes[3] = indexes[2];
    10.                weights[2] = weights[1];
    11.                indexes[2] = indexes[1];
    12.                weights[1] = weights[0];
    13.                indexes[1] = indexes[0];
    14.                weights[0] = w;
    15.                indexes[0] = i;
    16.             }
    17.             else if (w >= weights[1])
    18.             {
    19.                weights[3] = weights[2];
    20.                indexes[3] = indexes[2];
    21.                weights[2] = weights[1];
    22.                indexes[2] = indexes[1];
    23.                weights[1] = w;
    24.                indexes[1] = i;
    25.             }
    26.             else if (w >= weights[2])
    27.             {
    28.                weights[3] = weights[2];
    29.                indexes[3] = indexes[2];
    30.                weights[2] = w;
    31.                indexes[2] = i;
    32.             }
    33.             else if (w >= weights[3])
    34.             {
    35.                weights[3] = w;
    36.                indexes[3] = i;
    37.             }
    38.          }
    You'd think this would be horrible on a GPU given the common lore, but if TEXCOUNT is 16, for instance, this can save you dozens of texture samples. Yet it's a ton of branches.

    @bgolus if you can think of a neat trick to do that faster, let me know- it has to finished before MicroSplat can really start it's shading, so on low end devices it can be a bottleneck; it's basically just sorting the weights and indexes into the texture arrays so I can just sample the top N of them..
     
    Last edited: Mar 20, 2019
  8. sylon

    sylon

    Joined:
    Mar 5, 2017
    Posts:
    242
    I don't really understand this.
    Perhaps i didn't explain myself correctly.

    I now got my "filter idea" (at least functionally) working.
    In my fragment function, i check the incoming data, let's say a xy uv coordinate, against the uv coordinates i have available.
    I now do the same distance check in my vertex program.
    If that distance is smaller than a value, i pass a 1 to my fragment function. (in this case i just 'hacked' it into color.a). If not, a zero;
    Code (CSharp):
    1. i.color.a*=max(sign(dist-0.005), 0.0);
    Then i use an if statement in my fragment function to check that color.a value.
    Code (CSharp):
    1. if(i.color.a==1)
    2. {
    3.     for (int count = 0; count < 16; count++)
    4.     {
    5. ...
    So the idea is to only execute that loop if the effect is within range of the vertices. Which would be perhaps in 1% of the checks done.
    But then that only makes sense if that loop really is skipped.

    Having trouble profiling. Seems to be a weakness within me :)
     
  9. jbooth

    jbooth

    Joined:
    Jan 6, 2014
    Posts:
    3,534
    unlikely that loop is going to be skipped. You can force it with UNITY_BRANCH, but that doesn't mean it's an optimization to do so.
     
  10. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    6,846
    With that code it'll only execute when between 3 vertices of a single triangle which are all in range. That's the only time i.color.a will be == 1, and even then it won't actually be for many of the pixels due to interpolation and floating point error, so you'd need to be testing for >= 0.9999. If you've got big polygons, or the effect are is small such that only one or two vertices are "within range", the value will be < 1 everywhere in the fragment shader.
     
  11. sylon

    sylon

    Joined:
    Mar 5, 2017
    Posts:
    242
    Ok, but i tried it and like i said , at least functionally it works.
    I have added an else statement which makes the Albedo red, and the result is a few patches which are not red :)

    But UNITY_BRANCH only has an effect for HLSL shaders right?

    I looked at the compiled code of the shader (OpenGL3), and with or without UNITY_BRANCH, i see a branch there which to my non-shader eye looks good. There is an if statement and the loop is in there.
    But as i now understand, there is no guarantee the expensive part won't run on the device.
     
  12. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    6,846
    Yeah, UNITY_BRANCH is a macro for HLSL's [branch] which forces a real branch. There is no equivalent in OpenGL in any version, and the choice whether or not to do the branch is up to the shader compiler on the device you're running on. There's no way to know if the device will actually do a branch or not unless you look at the OpenGL shader binary produced by that device's shader compiler.