Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice
  2. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

Question Some more advanced shader questions

Discussion in 'Shaders' started by BOXOPHOBIC, Dec 27, 2021.

  1. BOXOPHOBIC

    BOXOPHOBIC

    Joined:
    Jul 17, 2015
    Posts:
    506
    Hi there! I have a few questions regarding shaders and I hope some use people with more experience can help me out. So without further ado, here they are:

    1. Branching
    When using branching, if each 2x2 pixel block is taking the same path we are good.
    - What happens if that 2x2 block is discarded?
    - What happens if it is overlapped by another mesh opaque or alpha cut mesh (like some grass).
    Do we lose that optimization? Are those pixes shaded anyways even if overlapped?

    2. Branching on high-ish-end mobile
    In my tests, branching on mobile with uniform floats always seems to have a major impact on performance and I always avoided using branching. Unity seems to do the same in Birp/URP. Tested on Xiaomi Mi Mix 2, Snapdragon 835. Any ideas?

    3. 2x2
    Regarding branching, I got 2x2 from Jason Booth's article: https://medium.com/@jasonbooth_86226/branching-on-a-gpu-18bfc83694f2. Some articles say 8x8, some 32x32, some 64x64. Is this related to the architecture? Until today I only heard about 32 and 64.

    4. Interpolators
    In my shaders, I often use many calculations per vertex, but how the shaders are set, I use quite a few interpolators, 5 to 7 maybe. Let's consider some simple math, like calculating a scale and offset for UVs and then passing it to the pixel shader. Can this interpolator be more expensive than the actual code?

    5. Samplers
    Let's say I have 6 textures used for blending and the texture limit is not exceeded. What would you suggest and why in terms of performance:
    a. Use 6 samplers, each tex with its own
    b. Use 2 samplers for each Albedo, Normal, Mask pair

    6. Noise texture vs math
    Classic question. Using an uncompressed 256 noise texture sampled in world space vs a 3d noise (approx 80 math instructions when checked with Unity's compile and show code), I get the same fps on mobile with a relatively complex scene. Some posts say sampling the texture is cheaper, I don't see any difference in fps. Is sampling a tex that expensive or is the simple math used for noise that cheap?

    7. Texture sampling size
    Is sampling 32x32 pixel in a 32x32 texture as fast as sampling 32x32 pixels in a 4k texture the same in terms of performance?

    8. Alpha Test
    Why is Clip() that expensive that there is an option in HDRP to bypass it in Forward or Deferred pass, and used with early Z optimization in the Depth pass? And the difference is big if it is not performed in more passes. I see up to 10 fps increase if the bypass is enabled. Or it is related to early Z optimization?

    999. Should I care that much?
    In most cases what I noticed is that shaders are just super-fast and GPUs just too complex. I can throw a bunch of features in there and in most cases I don't see much difference in performance. In most cases, I optimized shader by ear, trying to avoid too much texture sampling, using other textures as uv manipulation, Tring to add as much as possible to the vertex shader, and usually just being careful with what I do and how I can combine different features together to get the most out of my shaders. What is your approach, how do you debug performance?

    Thanks. More will come for sure :)
     
  2. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,329
    Those 2x2 blocks are called pixel quads. Taking different branches within the same pixel quad isn’t the end of the world. It just means all 4 fragments cost the same as all of the taken code paths.

    The same rules apply. The shader code is still running before the discard. Think of discard as a special case branch. If all fragments end up hitting the same discard, great. If not, then those pixel quads are a bit more expensive as it’s running both the discard code path and the other code path for all 4 fragments.

    If a single pixel is visible when rendered, the 4 pixel fragments of the pixel quad run. No matter what. Doesn’t matter if it’s another fully opaque surface or one using alpha testing / discard.

    Mostly fine, especially when doing branches on material properties. But mobile devices aren’t as powerful as desktop GPUs, and branching isn’t entirely free, so the costs are more obvious.

    The “same branch on all fragments in the pixel quad” rule is more accurately “in the warp/wave”. GPUs are SIMD processors that run a certain number of threads in parallel. How many depends on the architecture. AMD’s latest RDNA 2.0 for example can actually switch between 32 or 64 threads. So 8x8 groups of pixels all need to run the same branch, or they pay the cost of all code paths taken.

    On desktop & consoles this is absolutely true. If you can recalculate a value in the fragment shader and pass less data it’s almost always a win. Mobile … is supposed to be similar but it’s less obviously true. I’ve seen massive perf improvements doing per vertex calculations and passing more data between the vertex and fragment on relatively recent mobile devices (Quest).

    Use the one that’s easier for you to write. On Nvidia there can be a benefit to having unique samplers. On older AMD supposedly there can be a benefit to reusing one sampler for everything.

    But it’s complicated, and generally I’d suggest not worrying about it unless you run out of samplers. On mobile it matters even less since that usually doesn’t even support separate sampler states.

    Texture sampling can be a lot cheaper, especially if you’re looking to use a high quality noise. The old frac(sin(dot())) noise is stupendously cheap (but also really not very good). Perlin noise can be expensive to calculate you use a lot of noise octaves, cheapish if you limit to only 2 or 3. If you want blue noise you’ll want to use a texture.

    Should be about the same.

    Early Z reduces the cost of over shading. Over shading is when an object’s pixels are rendered, but do not appear in the final image because something else later renders in front of it. Grass and foliage are common cases where there can be a large part of the cost. A depth prepass can fill in the depth buffer with a very inexpensive fragment shader that has much less over shading cost. Then rendering the real shader later, even for “cheap” deferred shaders, means there are fewer fragments being rendered that don’t appear in the final image. The 2x2 thing still exists though, so it’s not 100% perfect.

    Both of those things are very much “best practices” … for GPUs that mostly don’t exist anymore, at least on desktop. Dependent texture reads (sampling a texture using the results of another texture) we’re a big no no for the first generation or two of GPUs with programmable shader support. It basically stopped being a concern starting with later era DirectX 9 class GPUs nearly 15 years ago.

    But they’re also not a bad thing to be mindful of for mobile, and good questions to ask in general.
     
  3. BOXOPHOBIC

    BOXOPHOBIC

    Joined:
    Jul 17, 2015
    Posts:
    506
    Thanks Ben for the detailed answers!
     
  4. BOXOPHOBIC

    BOXOPHOBIC

    Joined:
    Jul 17, 2015
    Posts:
    506
    9. Branching at Unity
    I'm wondering why unity almost never uses branches in their shaders. If I check BIRP, I can only find 2-3 UNITY_BRANCH usages, mainly for cubemap blending and shadows. There are about 4 UNITY_BRANCH found in URP. And some more in HDRP. Why is unity so conservative when it comes to branches?

    10. Branching in HDRP
    HDRP is supposed to run on modern hardware, so I was expecting some more branching, but instead, they use tons of ifs instead. Ifs without a branch will run both paths, right?
     
  5. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,329
    Unity's BIRP shaders were originally written some 7 years ago and needed to support a very different landscape of GPUs. Unity didn't even have full DirectX 11 support when many of the shaders were written, so Direct3D 9 and OpenGLES 2.0 was the assumed targets. Neither of those have very robust (or even at all existing) branching support.

    Today's Direct3D 11 & 12 class desktop GPUs and OpenGLES 3.1+ mobile GPUs are far better at branching, though it's still not always great on mobile hence it's quite conservative usage in the URP (which still has to support mobile and the Nintendo Switch). HDRP assumes Direct3D 11.1 or better, hence it's far more frequent usage.

    Just because there's no
    UNITY_BRANCH
    or
    [branch]
    (which is what that macro is) in the shader doesn't mean there aren't branches. Any
    if
    statement or
    for
    can be a branch if the shader compiler decides to make it one. And indeed any
    for
    loop that doesn't have a fixed count will be a branch on a modern GPU. And on older graphics APIs they would have been compiler warnings or even errors on certain platforms! URP's light loop for example is fixed light count on low end mobile, but dynamic for other platforms. If statements will be a branch if the compiler thinks it'll be more efficient to be a branch or not, the
    [branch]
    is just telling the shader compiler you really, really want this part to be a dynamic branch no matter what. AFAIK that only works for Direct3D. GLSL has no option to let you force a branch or not, so it's always up the compiler. I'm not sure how Metal or Vulkan handle things. I believe Vulkan you have to be explicit about if you want a branch or not, but Unity generates shaders for that target by converting the output from the Direct3D shader compiler into SPIR-V (the shader format Vulkan uses), so you'll likely get whatever decisions that compiler made in Vulkan. Metal may be the same, but I don't know.

    And as a final mind-funk ... using functions like
    step()
    or inline conditionals like
    foo > 1.0 ? bar : baz
    can still compile into branches if the compiler decides to.
     
  6. BOXOPHOBIC

    BOXOPHOBIC

    Joined:
    Jul 17, 2015
    Posts:
    506
    Alright, this explains a lot. I always thought you need to explicitly use UNITY_BRANCH so the if will be an actual branch. I was wondering why having it added or not, behave the same way, and this explains it. It was just a quick test I did and didn't actually check the compiled code. Thanks again for the explanations!
     
  7. BOXOPHOBIC

    BOXOPHOBIC

    Joined:
    Jul 17, 2015
    Posts:
    506
    10. Mipmaps
    If no mipmaps are used for a texture, is using tex2Dbias with a bias of 0 be more optimized, or it doesn't really matter:

    finalColor = tex2Dbias( _Albedo, float4( uv_Albedo, 0, 0.0) );
     
  8. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,329
    This likely depends on the hardware. This is definitely something I've seen recommend frequently in the past, but I've never actually been able to measure a perf difference when using a texture without mips in the use cases I've had.

    Certainly for some hardware I would assume it could have a perf advantage as it is explicitly not having to calculate the derivatives. But some hardware may be smart enough not to do that for textures without mipmaps already.
     
    BOXOPHOBIC likes this.
  9. BOXOPHOBIC

    BOXOPHOBIC

    Joined:
    Jul 17, 2015
    Posts:
    506
    The compiled code is definitely different, all I could find it this:

    sample_l ignores address derivatives, so filtering behavior is purely isotropic. Because derivatives are ignored, anisotropic filtering behaves as isotropic filtering.
    https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/sample-l--sm4---asm-
     
  10. BOXOPHOBIC

    BOXOPHOBIC

    Joined:
    Jul 17, 2015
    Posts:
    506
    11. Sampling per vertex vs per pixel
    Is sampling per vertex and then passing the value to pixel more optimized? Or sampling is the same regardless? What happens to the cache if sampled per vertex?
     
  11. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,329
    Oh yeah, compiled code will be different. But that doesn't guarantee a measurable performance difference.

    Though I should note I misread your question, and I suspect you may have miswrote it.
    tex2Dbias()
    still uses derivatives and is equivalent to
    sample_b
    . A
    tex2Dbias()
    with a bias of 0 isn't actually any different than
    tex2D()
    and I would expect some compilers to .
    sample_l
    is equivalent to
    tex2Dlod()
    . Also counter intuitively, it is plausible for
    tex2Dbias
    and
    tex2Dlod
    to be slower than
    tex2D
    in some use cases / hardware because both pass more information from the shader to the texture units, so if you're memory bandwidth limited vs sampler time limited the "faster" option of
    tex2Dlod()
    could be slower.

    Again, this will depend on the hardware. This was a common optimization in the past, and still potentially one for mobile. But it will absolutely be thrashing the cache as the texture sample positions are almost guaranteed to not be contiguous, unless you've taken care with the order of the vertices and the UV positions you're sampling from (and mesh optimization doesn't change too much / the GPU's execution order of the vertices works in your favor). Personally I don't think of per-vertex vs. per-pixel texture sampling from a performance point of view, but rather what the end goal is since the end visual results will be very different. But on modern desktop GPUs it's almost always cheaper to pass less information from the vertex shader to the fragment shader even if it means recalculating a lot of data in the fragment shader than it is to pass that data. In my experience mobile still seems to have performance benefits in passing data from the vertex to the fragment. It also depends on the density of the mesh; if there are more vertices than pixels, doing stuff per-pixel will always be cheaper, but also maybe think about using mesh LODs at that point.
     
    Saniell likes this.
  12. BOXOPHOBIC

    BOXOPHOBIC

    Joined:
    Jul 17, 2015
    Posts:
    506
    More vertices than pixels in unity? In a standard fullhd screen scenario? No, Unity can't handle that :D

    Thanks for the insights! I used tex2Dbias thinking it works the same as tex2Dlod.

    The fact is I'm trying to understand some of the low level stuff but in most cases it boils down to hardware and compilers. In my case, I have shaders with a dozen of features always On, 3 main textures, 3 details textures, 4 global texture arrays, 1 emissive tex, 2 3d noise textures, plus all the internal unity textures. And if instead of all these, I just use a constant color for the albedo, no other features, I end up with the same frame rate, on a consumer GPU.

    It is more a quest to understand how things work and where things can be improved.
     
  13. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,329
    By what measurement?

    Unity's fps display is showing the CPU framerate. If your framerate doesn't change between drastically different shaders, it's probably because you're CPU limited and not GPU limited for what you're rendering. You'd need to use some kind of GPU profiling to see the difference.
     
  14. BOXOPHOBIC

    BOXOPHOBIC

    Joined:
    Jul 17, 2015
    Posts:
    506
    I'm usually using straight-up fps measurements (the same as I do with "instructions" using the unity shader compiler), and since it is HDRP it is for sure CPU bound because my scenes are quite simple. But since I develop for the store and my scenarios doesn't count anyway, plus my customers don't report performance issues due to shaders, I guess it is fine :)
     
  15. BOXOPHOBIC

    BOXOPHOBIC

    Joined:
    Jul 17, 2015
    Posts:
    506
    12. Is the GPU automatically culling triangles that have "zero" size?

    Basically this, or some distance-based size fading:
    upload_2022-2-23_12-25-37.png

    I can see a huge difference in render doc, but I assume is it is because there are fewer pixels to shade.
    I also see performance differences on mobile when distance size fade is used.
    upload_2022-2-23_12-25-13.png

    In the recent article from the unity blog, they say this:
    https://blog.unity.com/games/experience-the-new-unity-terrain-demo-scenes-for-hdrp-and-urp
    One thing worth noting, however, is that the LOD Group component is not compatible with Terrain details, though you can still use Prefabs for details and tweak the cull distance in the Shader Graph shader or via the Detail Distance setting on the Terrain.


    The culling distance in SG is basically just a simple distance size fade:
    upload_2022-2-23_12-38-48.png
     
  16. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,329
    Yes.

    But that's not something RenderDoc is capable of showing you.

    RenderDoc can show you how long things are taking, but not necessarily why. Event using something like Nsight or other bespoke low level profiling tools which give much more granular information about what the GPU is doing won't really be able to show you this.

    The problem is at the level any of these tools will let you see, there's not really a difference between a triangle that's so thin or small that no fragments are rendered and a triangle that is infinitely small. The rasterization hardware is a black box that takes the vertex positions and the only output is how many fragments run afterwards. How that hardware chooses which fragments, and what optimizations they have to speed up those calculations beyond the basics of vanilla rasterization, probably falls under the realm of industry secrets. Render doc can tell you how many vertices a mesh has, how many triangles, and how many fragments of a mesh ends up in the final image. But it can't tell you how many of those were actually executed with any accuracy. Even the fragment execution count is a guess as it can't 100% differentiate between fragments that were skipped via early depth stencil rejection or simply not included in the final image due to late depth stencil rejection with 100% accuracy, it can only make an educated guess based on the render state and high level hardware capabilities. Even the vertex execution count is a guess assuming it's the number of vertices passed to the GPU, which isn't strictly accurate as some GPUs will process some vertices multiple times!


    But the TLDR of this when I've asked "people who know" (i.e.: people who have themselves worked on the hardware directly or whom can ask those people and get the answer) is, yes, infinitely small triangles are absolutely skipped on all commonly used GPU, and skipped in a way that's faster than triangles that are just too thin or small to be visible for the current resolution. Similarly vertices with a NaN position will cause triangles that use that vertex to be skipped for all commonly used GPUs.
     
  17. BOXOPHOBIC

    BOXOPHOBIC

    Joined:
    Jul 17, 2015
    Posts:
    506
    This is good to know! I always assumed scaling the vertices to 0 from the shader is the same as rendering the object. I even added an info for people to use a culling system.
    upload_2022-2-23_20-29-1.png

    Speaking of Render Doc, I have a shader that uses a branch to skip sampling a few textures. When checking the timings on 2 objects, 1 sampling the textures, 1 not, the is a big difference in timings, but RenderDoc always shows the textures in the Pixel Shader. So I assume, it just shows all the textures, all the time.

    Thanks again for the detailed explanations!
     

    Attached Files:

  18. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,329
    This might be getting pedantic, but scaling the vertices to zero from the shader is the same as rendering the object. You're still paying the full cost of rendering it on the CPU. Full cost of calculating the vertices (ignoring the possibility of branching to a fast path on "hidden" vertices). The only thing you're really saving is the cost of shading them and potentially a minor reduction in initial rasterization. This can be a non trivial cost depending on at what size on screen you start to cull them, but don't discount the rest of the costs you're still paying.

    Because the pixel shader still has two textures bound. Branches don't change that fact. Can't change that fact. It can only change whether not they get sampled.
     
  19. jbooth

    jbooth

    Joined:
    Jan 6, 2014
    Posts:
    5,461
    So on Unity shaders and branching:

    As bgolus pointed out, most of these shaders were written for very old hardware, and URP is essentially a port of these shaders in many ways. And they still have to support low end hardware like the Quest, which is particularly sensitive to GPU cost. So Unity has to be very conservative and keeps their shaders fairly simple.

    That said, another consideration is the new SRP batcher. The SRP batcher can handle multiple materials running the same shader variant, but each new shader variant causes a new batch. So depending on your feature set, it might be more efficient to use branches than variants for small feature changes. This is creating a bit of a dilemma for me on Better Lit Shader, which currently favors variants for everything. For instance, if the user uses brightness/contrast on some materials vs. others then that currently creates new variants, which breaks batching. So is paying ~7 cycles for a branch a better option there? Kind of depends on how the shader is used across your scene. If you have 20 such options as branches, that might completely kill performance on low end hardware. But that making hundreds of batches instead of one might also hurt performance in another way.

    Another thing I run into is constant buffer size, and I haven't done any measurements here to know if it's really a problem or not (I suspect it's not that big of a deal). Better Lit Shader has hundreds of features, and you can't #if #endif anything in the CBuffer because it will break batching, so every material has to upload a massive constant buffer. I could see that making the Set Pass calls a bit more expensive, but again I don't know by how much.

    Another thing I need to time soon is just how bad extra interpolators are on the Quest. I'm currently working on a Quest project, and Better Shaders (along with all of Unity's shaders and Shader Graph shaders) assume you need certain things like texcoord0 or the tangent. However, many of my shaders for this project use a MatCap style lighting system and do everything in worldspace, so don't have need for either of these.
     
  20. BOXOPHOBIC

    BOXOPHOBIC

    Joined:
    Jul 17, 2015
    Posts:
    506
    I noticed branching with textures can be tricky. For testing I use PIX lately as it can show if a texture is used or not. I use your derivates trick from Medium. All good and nice on pixel shader, but it always samples the textures on vertex shader (or pix is just showing it, or the GPU decided it is not worth branching, even if [branch] is used, possible?). If I use shared sampler it seems to work, but that can cause other issues when the sampler is not used. I think you have a workaround for that too . So by doing a lot of tests in PIX and RenderDoc, I never found any benefits and a good workflow of using branches so so far I don't use them. I probably don't have single feature that don't depend on some textures.

    As for variants, I keep everything at minimum in my store shaders so SRP batcher can do its work. On high end GPU, having all the features enabled is fine in my tests. If the users want simpler shaders, then can also disable them in amplify. I always use a big shader function with tons of options you can toggle.

    Not sure about the CBuffer, I have a ton of properties, not sure how those impact the performance. I use a few interpolators for and they seems to increase the performance a bit on mobile.

    PS: I'm still waiting the day I can use Better Shaders in ASE :p
     
  21. jbooth

    jbooth

    Joined:
    Jan 6, 2014
    Posts:
    5,461
    Branching being an optimization really depends on a lot of factors. In MicroSplat, if you create a shader with distance resampling, triplanar and stochastic (all of which essentially multiply the number of samples needed by 2 or 3), then branching is a significant savings. On a 2019 MBP, it's about 3x as fast. But in this case, you're often reducing the sample count from around 150 per pixel to 20-50 per pixel, and given that your memory bound, that means the branches are costing you significant time (in cases where the area isn't large enough, for instance). The culling is also hierarchical, in that a branch on the weight of a texture culls all of the stochastic, triplanar, and distance resampling functions as well, which makes a huge difference, because a single branch may cull a TON of work. (You can toggle the culling between several levels of how aggressively to branch, and see the difference).

    So it's not as simple as "Hey, here's a texture sample, let's branch around it to save performance".
     
  22. BOXOPHOBIC

    BOXOPHOBIC

    Joined:
    Jul 17, 2015
    Posts:
    506
    13. When using inline samplers, do they have any performance benefits compared to using the texture sampler?

    What happens if 10 textures are used, each with its own sampler, versus using 10 textures with the same inline sampler? Is the latter more optimized?

    Code (CSharp):
    1. tex2D( _Tex, uv_Tex );
    2. SAMPLE_TEXTURE2D( _Tex, sampler_linear_repeat_aniso2, uv_Tex );
     
  23. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,329
    In an apples to apples comparison of a texture’s sampler vs an identical configured inline sampler to sample a single texture, there’s no performance difference, because there isn’t a difference as far as the GPU is concerned.

    In the case of reusing one sampler to sample multiple textures, it’ll depend heavily on the GPU hardware. It can be a benefit for some GPUs, worse for others, and have no effect one way or the other the rest. And it’ll depend on what else you’re doing in the shader. You kind of just have to benchmark it to find out. Some GPUs don’t actually have multiple hardware texture units per fragment, so it doesn’t matter. Others it’s a benefit to use exactly the number of texture units the hardware has and no more. But how many a particular GPU has sometimes isn’t publicly revealed.
     
  24. BOXOPHOBIC

    BOXOPHOBIC

    Joined:
    Jul 17, 2015
    Posts:
    506
    I see, thanks for the answer. Even if they are the same, I can see 2 benefits of using inline samplers that are worth mentioning:

    a. You can use 128 texture or whatever the limit is in a shader, compared to per texture samplers
    Code (CSharp):
    1. maximum ps_5_0 sampler register index (16) exceeded
    b. URP/HDRP doesn't throw an error when some shared samplers are not used in some passes
    Code (CSharp):
    1. Fragment program 'frag': Unrecognized sampler 'sampler_albedo' - does not match any texture
    @jbooth If you don't mind, I know you bypass the above issue by using the texture as a dummy texture. Why don't you use inline samplers?
    Code (CSharp):
    1. o.Albedo *= saturate(1 + SAMPLE_TEXTURE2D_LOD(_MainTex, sampler_MainTex, float2(0,0), 12)).x;
     
  25. jbooth

    jbooth

    Joined:
    Jan 6, 2014
    Posts:
    5,461
    Mainly because then the user has no way to set various options, like clamping, trilinear, etc.. This is - not great; since every texture has controls for those things, but in most of my shaders that all comes from the albedo texture, so most of those controls are ignored, but at least you can still set them. The alternative would be to either not provide access to these, or do them via shader_features (or in the case of MicroSplat the shader generator), which would be a horrid use of variants or generation options. I do use inline samplers where user control of the texture options is not needed though.

    IMO, samplers are pretty broken on GPUs from a configurable shader point of view, and the DX9 style "sampler association with a texture" is even worse.
     
  26. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,329
    This isn't a benefit specific to inline samplers so much as reusing samplers.

    This is because Unity (correctly) culls any texture references not used by a pass, but that also means there's no texture asset to get a sampler state from. This will happen in the built in renderer too and isn't specific to the SRPs.
     
  27. jbooth

    jbooth

    Joined:
    Jan 6, 2014
    Posts:
    5,461
    Nit picky, but this is actually up to the compiler. The OSX compiler does not cull these, which has been the bane of my existence since I develop on OSX and don't see these issues until someone on windows reports them.
     
    bgolus likes this.
  28. BOXOPHOBIC

    BOXOPHOBIC

    Joined:
    Jul 17, 2015
    Posts:
    506
    I was thinking this might be the case, I also use per texture samplers so users can configure them as they like.
     
  29. BOXOPHOBIC

    BOXOPHOBIC

    Joined:
    Jul 17, 2015
    Posts:
    506
    Yes, true, I was mentioning it in the context of the second point with the unregistered sampler issue, a bit easier to manage with inline samplers.

    Interestingly, I don't remember seeing this issue with surface shaders, and with urp7/hdrp7 if I remember correctly, but it makes sense. I know I was reusing samplers from the albedo for a long time in my assets, until one point when unity started throwing the error. But I can't remember for sure with the ever changing landscape "unity" provides :D
     
    Last edited: May 24, 2022
  30. Just_Lukas

    Just_Lukas

    Joined:
    Feb 9, 2019
    Posts:
    34
    Hey boys, very helpful thread this is so I would like to also step in with my own questions:

    14. Is it always a good practice to add uniforms to constant buffers?
    I know this is required for SRP batcher compatibility with UnityPerMaterial cbuffer but I've noticed things like unity_MatrixVP isn't inside any CBUFFER at all, so also are these non-cbuffer uniforms added to some default "global" cbuffer by the compiler or does it depend on the graphcis API?

    15. Why Textures & SamplerStates don't need to be included inside CBUFFERs?
    My theory is that textures are resource types and thus are handled differently by GPUs but I couldn't find more info that would assure me of this.
     
    Last edited: Jun 5, 2022
    BOXOPHOBIC likes this.
  31. aleksandrk

    aleksandrk

    Unity Technologies

    Joined:
    Jul 3, 2017
    Posts:
    2,983
    It depends on the graphics API. Some of them don't even have a concept of separate uniforms, so all ends up in a cbuffer anyway. Others do, and it may be more efficient to have separate uniforms than cbuffers there. It all depends on the use case.

    You're right, other resources cannot be part of cbuffers :)
     
  32. jbooth

    jbooth

    Joined:
    Jan 6, 2014
    Posts:
    5,461
    One thing I've always wondered about and been unable to find an answer on is the cost of CBuffer size. Better Lit Shader, for instance, has a pretty massive cbuffer for it's properties, since doing an #if #endif around CBuffer elements breaks SRP batching. In most cases, 95% of that cbuffer is unused, since the shader has a massive number of options. I've also noticed that people have stopped packing things together at the property level (4 floats in a Vector instead of 4 properties), so I'm wondering what the impact here is on a modern GPUs/APIs.
     
  33. aleksandrk

    aleksandrk

    Unity Technologies

    Joined:
    Jul 3, 2017
    Posts:
    2,983
    The compiler culls the unused values from the cbuffer, so you only pay for the memory (the cbuffer layout doesn't change between variants). Since cbuffers don't exceed 64KiB anyway, it's not an issue. It may be non-optimal on mobile devices where memory bandwidth is scarce - this will depend on whether the whole cbuffer is uploaded for each draw call or just the modified parts.
     
    BOXOPHOBIC and jbooth like this.
  34. vlery

    vlery

    Joined:
    Jan 23, 2017
    Posts:
    16
    Hey, I think 'Dependent texture reads' has some little different definition which like being recommended to avoid because of 'pre-fetch texture data before fs' in my opinion and which seems only designed by powerVR. But I haven't found much discussions on metal gpu considering they hired many people from powerVR.
     
    Last edited: Sep 17, 2022
  35. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,329
    A Dependent Texture Read is any time a UV is modified in the fragment shader prior to being used to sample a texture. Using the results of another texture sample to modify those UVs is just one of the more common cases. I.E.: it is dependent on other instructions occurring before it can be sampled.

    What you're describing is an Independent Texture Read, or one that doesn't get modified in the fragment shader, and thus can be pre-fetched. This is something that modern GPUs just don't really do anymore, and ceased to be a concern for PowerVR GPUs around Rogue (2012 on). PowerVR wasn't the only GPU that had this optimization / performance issue, Nvidia GPUs from the decade prior also had issues with Dependent Texture Reads and if you look at optimization guides from them circa mid 2000's you'll find recommendations to avoid them.

    As a basic rule of thumb, any GPU with support for OpenGL ES 3.1 or DirectX 10 doesn't take a nose dive in performance when using dependent texture reads and no longer explicitly pre-fetches texture data for independent texture samples. The only GPU I don't think ever have this issue is Mali.

    That's not to say dependent texture reads are "free" on more recent hardware. They're still more expensive than independent reads, just no longer to the point of being a major concern. AMD optimization guides from 2012 still include warnings about them, though don't explicitly tell you to avoid them.
     
  36. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,329
    Also a basic example of a Dependent Texture Read that can actually be an optimization vs an Independent Texture Read for modern hardware:
    Code (csharp):
    1. float2 uv_MainTex = TRANSFORM_TEX(IN.uv, _MainTex);
    2. float2 uv_EmissiveTex = TRANSFORM_TEX(IN.uv, _EmissiveTex);
    3.  
    4. float4 col = tex2D(_MainTex, uv_MainTex);
    5. float4 em = tex2D(_EmissiveTex, uv_EmissiveTex);
    Unity's built in renderer shaders default to doing the
    TRANSFORM_TEX
    for UV tiling and offset in the vertex shader and passing that data to the fragment shader to be used. That results in independent texture reads which you would expect to be the cheaper option.

    However Unity's Shader Graph and SRP shaders generally do the equivalent UV transformations in the fragment shader instead making these dependent texture reads ... but on the majority of hardware this is faster! Why? Because the cost of transferring two extra float values isn't free, but two extra
    MAD
    (multiply add) instructions in the fragment shader nearly are.
     
    Pangamini likes this.
  37. vlery

    vlery

    Joined:
    Jan 23, 2017
    Posts:
    16
    Sorry for the false statement and I am gonna to fix that in the original reply to avoid misunderstand . And thanks for the detailed explanation. I have seen some tests to get the similar conclusions on modern GPUS, but really appreciate for providing the overall view!
     
  38. jbooth

    jbooth

    Joined:
    Jan 6, 2014
    Posts:
    5,461
    Personally I don't consider that a dependent texture read, because while you are modifying the UV coordinates in the fragment shader, you are not doing it dependent on another texture read. I would consider a DTR to be:

    Code (CSharp):
    1. float noise = tex2D(_Noise, uv);
    2. float4 col = tex2D(_MainTex, uv_MainTex + noise);
    3.  
    Because this requires the result of the first texture fetch to finish before the second one can be started.

    I have noticed that in a reasonably complex shader one level of this tends to get pipelined out, but multiple levels of this start to really stall the GPU. It's all about if there's enough work to keep the GPU busy while it waits for that first texture read.
     
  39. jbooth

    jbooth

    Joined:
    Jan 6, 2014
    Posts:
    5,461
    for instance, the hellscape of DTRs in MicroSplat looks somewhat like this:

    - Sample up to 8, uncompressed textures for the splat maps
    - Sort those 32 weights and take the top 4
    - Sample per-texture UV's from a small LUT if needed
    - Sample 4 best texture sets (albedo, normal, etc)

    When I switch to a single texture format (2 indexes and a weight) instead of 8 of these splat maps, the shader gets significantly faster, and a tiny bit faster when pertexture UV scales are disabled. That's entirely because the GPU must wait for the completion of each stage to start the next one.
     
  40. aleksandrk

    aleksandrk

    Unity Technologies

    Joined:
    Jul 3, 2017
    Posts:
    2,983
    Some older GPUs do :)
     
    bgolus likes this.
  41. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,329
    The original definition of dependent texture reads is any fragment shader modifications to the UVs. But for the last decade the term has kind of morphed to refer only to modifying UVs based on another texture read, because that’s the only case that really matters today.

    We kind of need a new term for it, or call it DTR 2.0, or something. Because if you just say “dependent texture reads are bad”, and someone looks it up, they might come across the old Nvidia or PowerVR optimization guides that say you need to calculate the UVs in the vertex shader and not touch them in the fragment shader, which is simply not true anymore.
     
    vlery likes this.
  42. jbooth

    jbooth

    Joined:
    Jan 6, 2014
    Posts:
    5,461
    Yeah, ironically, to me the current term describes the modern version perfectly - since you're dependent on a texture read to compute something- but the fact that people used it to mean "modified UV in the fragment shader" was wrong.. haha
     
    bunnybreaker likes this.
  43. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,329
    It was dependent on the fragment shader.

    Fragment Shader Dependent Texture Read vs Texture Read Dependent Texture Read
     
    bunnybreaker likes this.