Search Unity

Optimizing tex2D perforamance

Discussion in 'Shaders' started by exltus, Jan 21, 2021.

  1. exltus

    exltus

    Joined:
    Oct 10, 2015
    Posts:
    58
    Iam trying to optimize my tex2D calls by packing my single channel textures to packed textures. Lets say that I have shader code like this

    VARIANT 1
    Code (CSharp):
    1. // here I have 4 different textures any Iam sampling only one channel from each one
    2. half val1 = tex2D(_Tex1, i.uv1).r;
    3. half val2 = tex2D(_Tex2, i.uv2).r;
    4. half val3 = tex2D(_Tex3, i.uv3).r;
    5. half val4 = tex2D(_Tex4, i.uv4).r;

    VARIANT 2
    Code (CSharp):
    1. // here Iam sampling same texture multiple times (UV is always different - thats why I cant sample it just once) and texture contains desiered data in specific channels
    2. half val1 = tex2D(_TexPacked, i.uv1).r;
    3. half val2 = tex2D(_TexPacked, i.uv2).g;
    4. half val3 = tex2D(_TexPacked, i.uv3).b;
    5. half val4 = tex2D(_TexPacked, i.uv4).a;
    My question is - will be VARIANT 2 faster than VARIANT 1? Or is there any way how to optimize texture tex2D calls?
     
  2. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,352
    VARIANT 1 vs VARIANT 2 can both be faster or slower than the other.

    There are a few ways to think about optimizing texture sampling.

    Texture compression and/or reduced resolution
    A compressed texture is faster to read from than an uncompressed texture. Or more specifically it's faster to load from the GPU's RAM into the texture unit's cache. The texture unit, the bit of hardware that actually samples the texture. It'll have some amount of local memory that needs to be filled from the GPU's main memory with a small chunk of the texture being sampled. That takes time, which slows things down. On PC there is a single channel compression format, BC4, but it is only a 2:1 compression ratio over an uncompressed R8. But a single RGBA DXT5, ASTC 6x6, or ETC2 texture will be better than four R8 textures, and can be better than four BC4 textures, but can also be worse.

    Single Sampler vs. Unique Samplers
    As mentioned above GPUs have hardware texture units that do the actual sampling of a texture. These are roughly tied to individual samplers within the shader. So sampling a single
    sampler2D
    4 times can be slower than sampling 4 different
    sampler2D
    s. This is because the single texture may be limited to running in serial, one sample at a time, where as multiple textures can be run in parallel. Note, the multiple
    sampler2D
    case, all 4 texture properties can be the same texture asset, and doing so may be a performance benefit. If you swap to using DX11 style
    Texture2D
    and inline
    SamplerState
    s you can simplify it a bit so you only have one texture property for your material and still use separate texture units for each sample. But whether or not this is faster is highly dependent on the hardware. For example some GPUs may be able to reuse the cached chunk between multiple texture units, meaning it doesn't need to be loaded for each unit individually, but that's only if the UV position on the texture being sampled is within the same chunk for multiple texture units. This is also where the comment about four BC4 textures can be better than a single DXT5 comes from. It also seems like some GPUs don't actually map a unique sampler to a unique texture unit, so there may not actually be a difference between sampling one texture multiples times vs multiple samplers on that hardware.

    DXT5 (aka BC3) and BC4 are block compression style formats, as are most GPU compression formats. DXT5 & BC4 use 4x4 pixel blocks, with DXT5 using 16 bytes per block and BC4 using 8 bytes per block. If your UV positions are close together, the GPU may only need to load 1 block for all 4 samples, meaning it only needs to load between 16 and 64 bytes for DXT5 (assuming bilinear filtering, if you're on a position between the blocks you need to also load the adjacent blocks). For BC4 this is between 32 and 128 bytes, since each sample needs blocks from its own unique texture. If the samples are not close together or the GPU's texture units don't share a cache, the BC4 example doesn't change, but the DXT5 case now uses between 64 and 256 bytes (16 bytes per block x 4 unique sample locations that don't overlap x between 1 and 4 adjacent blocks), so now DXT5 is slower. Most of this is moot on mobile platforms since they don't have a single channel compression format, so just stick with ETC2 or ASTC.

    Load / Point sampling
    On some low end hardware, having your texture set to use point filtering vs bilinear (or especially vs trilinear or anisotropic) will be faster. Using DX11 style
    Texture2D
    for your textures you can also use the
    Load()
    function which is potentially even faster. But this is only true if you don't need bilinear filtering for your samples. You could also potentially use the
    Gather()
    functions to get 4 texels of a single color channel and do the filtering in the shader, but I doubt that'd be faster.
    Load()
    and
    Gather()
    also complicate things if you're using mip maps. I'll also note when I've tried these options I could not measure a performance difference between point sampling a texture with no mip maps and
    Load()
    , even though I've seen several other people comment they saw significant performance improvements when switching to it. Could be the difference in GPU, or the exact workload, or user error (on their part or mine).


    The TLDR version is ... try it and see, just be sure to try it on the platform you're going to use it on, because it might be different than than in the editor.
     
    luosiri likes this.