Use array or texture in my shader?

sylon · Mar 19, 2019

I am working on a surface shader that i'm sending data to. It processes this data in the surf function.
I was wondering, what would be more efficient to use:

Sending a Vector4 array to my shader (in Update) and reading that float4[] in a loop.
Or do the same thing, but with a small texture?
I would suspect that having x texture lookups per pixel would be heavier?

bgolus · Mar 20, 2019

The texture is a little heavier, yes, especially if you plan on iterating over all of that data for every pixel being rendered. For a hand full of samples, there may be no real difference.

jbooth · Mar 20, 2019

There's kind of a cross over point though- In MegaSplat I have the problem of needing to store many per-texture properties for 256 textures. Besides there being no good way to store this data with the material other than a texture (Unity doesn't support array properties), setting them on the material and looking them up seemed to be slower than using a texture. My guess is that at some point fetching from where that data is stored is slower than looking in a small texture which is heavily cached.

I've also found really oddball things with using arrays in shaders. For instance, in MicroSplat, which at the time only supported 16 textures (now 32), I thought I'd switch back to properties. Originally I though oh, maybe I'll store each value as 4 Vector4's, and do some macro foo to select the proper value from the arrays via index. That way I don't need to manage this texture for looking up the properties. So basically each value is:

half4 Value0;
half4 Value1;
half4 Value2;
half4 Value3;

and using some macro foo, I can do something like

GET(Value, 13);

which turns into Value3.y;

It worked, but was unbelievably slow compared to the texture approach.

bgolus · Mar 20, 2019

That's a more unique case though. If you were actually using an array, it'd be significantly faster. Selecting a single component at runtime from a handful of float4 uniforms is going to be slower no matter how you swing it; you're going to be adding a lot of extra instructions constructing arrays, doing equal comparisons, or dot products. Either way you're forcing the shader to access the data of all of the uniforms compared to directly accessing an existing array index.

The cheapest I could think to do this is around 16 instructions, vs 1 for an array access.

Code (csharp):

float GetValue(int index)

{

float4 values[4] = {_Value0, _Value1, _Value2, _Value3};

int indexUniform = index / 4;

int indexComponent = index % 4;

return values[indexUniform][indexComponent]; // this is actually several instructions, including a dot product

}

sylon · Mar 20, 2019

Thanks guys for your insight.
I'll try to profile the difference.

Since i am iterating in the surf function, i understand it would execute for every pixel.
I am now trying to filter conditions first in the vertex function.
Test there if the affected pixels are close enough to the vertex position. And pass that (1/0) to the surf function.
Haven't got that working yet though.
And i can't think of any other way to skip my loop than with an if statement.
But surely, in this case, an if statement is cheaper than looping?

Really like this stuff though hard to stop experimenting.

bgolus · Mar 20, 2019

sylon said: ↑

filter conditions first in the vertex function
Click to expand...

This is a fool's errand outside of very specific, and limited cases. The data the fragment shader (and thus the surf function, which is justed called in the fragment shader function) is interpolated between 3 vertices, so you can't just pass a single index or set of values out of the vertex shader and get anything useful in the surf function. There are ways to reconstruct a limited set of data, like a single index per vertex, but it requires significant mesh processing before hand. @jbooth 's MegaSplat does this, and it works great, but it only needs to reconstruct the 3 texture array indices that it'll blend between.

Really what you'd want to do is more akin to something like tiled or clustered lighting systems where lights are pre-binned into a grid that the shader then looks up into and iterates over. You could conceivably do this per triangle and use the SV_PrimitiveID (triangle index) to access the data, but you can't get the primitive ID in a surface shader. You'd have to rely on local, world, or UV position.

I'm not entirely sure what you're looking to do, but there's a ton of literature on tiled and clustered lighting. Depending on your specific use case these may be overkill.

sylon said: ↑

But surely, in this case, an if statement is cheaper than looping?
Click to expand...

Depends on how the shader compiler (and GPU) decides to handle the "if". In a lot of cases an if in a shader isn't actually a branch as you would expect from traditional CPU centric languages. An example:

Code (csharp):

float dist = 0;

if (_ValueCount > 0)

{

for (int i = 0; i < _ValueCount; i++)

dist = max(dist, length(pos.xyz - _Values[i].xyz));

}

In C#, the expectation would be that the loop never runs. Though, if the count was zero, it's unnecessary code as it would be skipped anyway... we'll ignore that for now. The problem is the compiled shader may do that, or it may choose to reorder the code to look more like this:

Code (csharp):

float dist = 0;

float tempDist = 0;

for (int i = 0; i < _ValueCount; i++)

tempDist = max(dist, length(pos.xyz - _Values[i].xyz));

dist = _ValueCount > 0 ? tempDist : dist;

In fact, the loop might not even be a loop, but "flattened", like:

Code (csharp):

tempDist = max(dist, length(pos.xyz - _Values[0].xyz));

dist = 0 < _ValueCount ? tempDist : dist;

tempDist = max(dist, length(pos.xyz - _Values[1].xyz));

dist = 1 < _ValueCount ? tempDist : dist;

tempDist = max(dist, length(pos.xyz - _Values[2].xyz));

dist = 2 < _ValueCount ? tempDist : dist;

// etc to some arbitrary max count the shader compiler decides.

// this means if the compiler decides to do 100 iterations, and your _ValueCount is 5

// it's still doing 100 iterations!

To be fair, that last outcome is what you should expect from an OpenGLES 2.0 or DirectX 9 device rather than a modern GL 3.0, GLES 3.0 or DX11 GPU, which is more likely to not unroll loops, but may still not do a "real" branch if the compiler doesn't think it'll be expensive enough to warrant one.

jbooth · Mar 20, 2019

The real problem with branching is with texture access- you can use the gradient or lod samplers instead of the regular ones so you can branch and access textures, but if your not careful, you can break the quad pixel optimization the GPU does to share texture samples between neighboring pixels, which can really slow things down. However, executing both sides of the branch on simple code is usually not that big of a deal- modern CPUs will do that as well sometimes, because quite frankly calculation is rarely the bottleneck (memory access is). It can still be a bottleneck in shaders, but modern GPUs are pretty beastly things.

As an example, one of the first things MicroSplat does is a giant loop like this:

Code (CSharp):

int i = 0;

for (i = 0; i < TEXCOUNT; ++i)

{

fixed w = splats[i];

if (w >= weights[0])

{

weights[3] = weights[2];

indexes[3] = indexes[2];

weights[2] = weights[1];

indexes[2] = indexes[1];

weights[1] = weights[0];

indexes[1] = indexes[0];

weights[0] = w;

indexes[0] = i;

}

else if (w >= weights[1])

{

weights[3] = weights[2];

indexes[3] = indexes[2];

weights[2] = weights[1];

indexes[2] = indexes[1];

weights[1] = w;

indexes[1] = i;

}

else if (w >= weights[2])

{

weights[3] = weights[2];

indexes[3] = indexes[2];

weights[2] = w;

indexes[2] = i;

}

else if (w >= weights[3])

{

weights[3] = w;

indexes[3] = i;

}

}

You'd think this would be horrible on a GPU given the common lore, but if TEXCOUNT is 16, for instance, this can save you dozens of texture samples. Yet it's a ton of branches.

@bgolus if you can think of a neat trick to do that faster, let me know- it has to finished before MicroSplat can really start it's shading, so on low end devices it can be a bottleneck; it's basically just sorting the weights and indexes into the texture arrays so I can just sample the top N of them..

sylon · Mar 21, 2019

bgolus said: ↑

This is a fool's errand outside of very specific, and limited cases. The data the fragment shader (and thus the surf function, which is justed called in the fragment shader function) is interpolated between 3 vertices, so you can't just pass a single index or set of values out of the vertex shader and get anything useful in the surf function.
Click to expand...

I don't really understand this.
Perhaps i didn't explain myself correctly.

I now got my "filter idea" (at least functionally) working.
In my fragment function, i check the incoming data, let's say a xy uv coordinate, against the uv coordinates i have available.
I now do the same distance check in my vertex program.
If that distance is smaller than a value, i pass a 1 to my fragment function. (in this case i just 'hacked' it into color.a). If not, a zero;

Code (CSharp):

i.color.a*=max(sign(dist-0.005), 0.0);

Then i use an if statement in my fragment function to check that color.a value.

Code (CSharp):

if(i.color.a==1)

{

for (int count = 0; count < 16; count++)

{

...

So the idea is to only execute that loop if the effect is within range of the vertices. Which would be perhaps in 1% of the checks done.
But then that only makes sense if that loop really is skipped.

Having trouble profiling. Seems to be a weakness within me

jbooth · Mar 21, 2019

sylon said: ↑

I don't really understand this.
Perhaps i didn't explain myself correctly.

I now got my "filter idea" (at least functionally) working.
In my fragment function, i check the incoming data, let's say a xy uv coordinate, against the uv coordinates i have available.
I now do the same distance check in my vertex program.
If that distance is smaller than a value, i pass a 1 to my fragment function. (in this case i just 'hacked' it into color.a). If not, a zero;

Code (CSharp):

i.color.a*=max(sign(dist-0.005), 0.0);

Then i use an if statement in my fragment function to check that color.a value.

Code (CSharp):

if(i.color.a==1)

{

for (int count = 0; count < 16; count++)

{

...

So the idea is to only execute that loop if the effect is within range of the vertices. Which would be perhaps in 1% of the checks done.
But then that only makes sense if that loop really is skipped.

Having trouble profiling. Seems to be a weakness within me
Click to expand...

unlikely that loop is going to be skipped. You can force it with UNITY_BRANCH, but that doesn't mean it's an optimization to do so.

bgolus · Mar 21, 2019

sylon said: ↑

So the idea is to only execute that loop if the effect is within range of the vertices. Which would be perhaps in 1% of the checks done.
But then that only makes sense if that loop really is skipped.
Click to expand...

With that code it'll only execute when between 3 vertices of a single triangle which are all in range. That's the only time i.color.a will be == 1, and even then it won't actually be for many of the pixels due to interpolation and floating point error, so you'd need to be testing for >= 0.9999. If you've got big polygons, or the effect are is small such that only one or two vertices are "within range", the value will be < 1 everywhere in the fragment shader.

sylon · Mar 22, 2019

bgolus said: ↑

With that code it'll only execute when between 3 vertices of a single triangle which are all in range. That's the only time i.color.a will be == 1, and even then it won't actually be for many of the pixels due to interpolation and floating point error, so you'd need to be testing for >= 0.9999. If you've got big polygons, or the effect are is small such that only one or two vertices are "within range", the value will be < 1 everywhere in the fragment shader.
Click to expand...

Ok, but i tried it and like i said , at least functionally it works.
I have added an else statement which makes the Albedo red, and the result is a few patches which are not red

jbooth said: ↑

unlikely that loop is going to be skipped. You can force it with UNITY_BRANCH, but that doesn't mean it's an optimization to do so.
Click to expand...

But UNITY_BRANCH only has an effect for HLSL shaders right?

I looked at the compiled code of the shader (OpenGL3), and with or without UNITY_BRANCH, i see a branch there which to my non-shader eye looks good. There is an if statement and the loop is in there.
But as i now understand, there is no guarantee the expensive part won't run on the device.

bgolus · Mar 22, 2019

Yeah, UNITY_BRANCH is a macro for HLSL's [branch] which forces a real branch. There is no equivalent in OpenGL in any version, and the choice whether or not to do the branch is up to the shader compiler on the device you're running on. There's no way to know if the device will actually do a branch or not unless you look at the OpenGL shader binary produced by that device's shader compiler.

Search Unity

Unity ID

Useful Searches

Use array or texture in my shader?