Search Unity

Question Help on improving shader performance

Discussion in 'Shader Graph' started by killerfrogy, Jul 23, 2021.

  1. killerfrogy

    killerfrogy

    Joined:
    Sep 17, 2013
    Posts:
    12
    I am after some advice on improving the performance of a shader I have written. The shader is used on all of the tiles in an infinitely generating 2d game I am currently working on. Here's a photo of what the shader looks like in game:
    upload_2021-7-23_16-29-37.png
    The shader does several things, it adds an outline/border to the tiles. It adds a pseudo 2d ambient occlusion, and also shades the tile based on it's light value and it's neighboring light values. Due to each tiles material being passed data specific to itself, such as it's light value and it's neighboring light value, etc, every tile has it's own instance of the shader, meaning no SRP batching can be achieved.

    Here's a screen grab of the shader:
    upload_2021-7-23_16-34-25.png

    This is my first time asking for help improving the performance of a shader, so I am likely lacking a lot of information in this post that may help you help me. Please let me know what other pieces of info I can add to this post to give you some more insight.

    I am open to making drastic changes, such as remaking it in amplify, etc. If that would help me. I was also wondering if there is such a thing as batching different sections of the shader.
     

    Attached Files:

  2. Antypodish

    Antypodish

    Joined:
    Apr 29, 2014
    Posts:
    10,780
    You got lot of stuff happening there.

    First, you need organise thing. Group shader sections. Create sub graph. Reause them where repeating part of code.
    Reduce main shader to few/fewr nodes.

    Align nodes properly and use 90deg shader connecting lines approach. This way will be easier to follow what is going on. Random directions of diagonal lines is just mess.
    Double click on the line, to create connector node, to help organise them.
    You can also create groups of nodes.

    Try use texture as mask, when possible, instead recreating such masks.
    It appears you build Varius elements using shaders, where texture would be more than sufficient. Then just apply the relevant filter.

    When you reduce and optimise shader, will be easier to see, what next may be needed, to optimise further.
     
  3. LandonTownsend

    LandonTownsend

    Unity Technologies

    Joined:
    Aug 21, 2019
    Posts:
    35
    Be sure not to confuse "My shadergraph uses a lot of nodes" with "my shadergraph is expensive" and also remember that reducing the amount of nodes / making the graph less complicated looking does not always make it perform better. Some nodes work out to anywhere from dozens to hundreds of math operations, I especially want to point out that texture samples can be more expensive than dozens of simple math operations so replacing simple calculations with texture samples is not always optimal. On the other hand, if you are using procedural noise anywhere, such as voronoi or gradient noise, those perform hundreds of math operations per pixel, so replacing those with texture samples can significantly improve performance. This is why if you need help getting your graph to perform better, knowing which nodes you're using can be more important than knowing the amount of nodes you're using, so instead of taking a zoomed out screenshot of your graph it might be better to provide several more zoomed in screenshots so others can see which nodes are actually in use.
     
  4. Antypodish

    Antypodish

    Joined:
    Apr 29, 2014
    Posts:
    10,780
    There are valid points indeed.

    Yet I want point at very top part of graph, which creates simple corner effect.
    Mind, this is just. For. 2D game. Seems shaded corner Dont need to be high res, nor procedural.
    However, the effect is created with about 30 nodes.
    Surely loading corner mask texture is much more performant in this case.
     
  5. hippocoder

    hippocoder

    Digital Ape

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    Some questions:
    • why break batching?
    • why do it all in a fragment shader every frame
    • why not alter some source art for outlines?
    I'm seeing fat zero reasons for any of this being in-shader to begin with. If done right you would address the entire perf problem automatically. This means you would want to go with tilemap, vertex data and drive most of it on CPU because it would only update on changes such as digging or when a chunk of tiles are about be rendered.

    If it's all going to be on GPU in a shader, then you are probably just not going to be able to optimise it easily without sharing the entire shader. And likely it will be a bandwidth and ALU problem - looks to me possibly both but I can't see properly what you are doing.
     
  6. april_4_short

    april_4_short

    Joined:
    Jul 19, 2021
    Posts:
    489
    This question and the screenshot might be the best case against Shader Graph I've seen, so far.

    Is there a way to generate a code file from that Shader Graph, so folks can read it and assist?

    A way to share Shader Graphs that provides the OP some copyright protection if he's got unique things going on he might want to protect?

    Shaders can be copyright protected, can a Shader Graph be protected?
     
  7. Qriva

    Qriva

    Joined:
    Jun 30, 2019
    Posts:
    1,314
    I am confused now. I know texture sampling is not cheap operation, but some people say it is better to replace sampling with node such as gradient noise or some other procedural things, while other state it's better to sample noise texture instead to generate proceduraly.
    What is ultimately true?
     
  8. hippocoder

    hippocoder

    Digital Ape

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    To answer this is pretty simple (going to keep it dumbed down):

    If your particular project uses a lot of bandwidth then it's a bandwidth problem.
    If your particular project uses a lot of maths, then it is an ALU problem.

    How much of a problem those are will 100% depend on how strong the GPU is for a particular device. On a mobile, that's likely a problem. On a 3090 it will likely not be a problem.

    There is no "ultimately true" -- that makes no sense, ever. Simply understand that it takes resources and resources are finite.

    An open world game is going to have a LOT of bandwidth abuse simply because there's so much of it. A 2D game probably won't tax bandwidth or math at all. But what if the resolution was much higher? In this case it would.

    The reason you're hearing things like math is better for performance is just because GPUs have been getting ALU performance increases for years but memory bandwidth has stagnated a lot. This is just usually a desktop GPU discussion. Mobiles have awful math performance compared to desktop.

    This in no way means ALU is great and free. In fact's really crap on mobiles so there you go. On a mobile though the project is unlikely to be of a scope that would eat through bandwidth that much so using LUTs and textures to store data is a great strategy.

    Check out https://blog.unity.com/community/shadowgun-optimizing-for-mobile-sample-level - in this case there is a BRDF stored in a texture but now with physically based shaders, this is usually done in math. On a lot of hardware, the BDRF texture approach is still faster. If that same mobile used many more texture samples, or full screen post (likely math is simple but bandwidth is not) then the BRDF texture lookup might even be slow on an old mobile and math would win out again.

    As above though, it depends as a whole how much you are hitting that GPU. You may put some in math, you might put some in texture, and balancing those two is called optimisation.

    Typically though here's why a lot of those asset store assets are a real problem in a real game. They run great in the demo scenes but I've never seen a product demo scene on asset store that hits bandwidth at all, so everyone assumes it's OK but in a real game scenario that's shunting a forest's worth of data around... not so much. In fact the technique may well be useless and need to be done from scratch with the available bandwidth / math in mind.

    Gonna shut up and be pushed into a small care home cupboard while the staff go home.
     
  9. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,352
    To try to distill some of the suggestions made above:

    You don't need to do everything in one massive shader. Using multiple materials, or rendering parts (like the faux AO) to a low resolution render texture that you then sample from can help this significantly. None of us can really see what's going on in that shader since it's too zoomed out to give any more specific feedback, but realistically there's probably nothing you can do to significantly improve performance when using a single monolithic shader like this. Most optimization for shaders comes from doing less of something, or just not doing something at all, vs. putting a lot of effort into trying to make the thing you're doing faster. And one of the easiest way to do less of something is to do it at a lower resolution as a separate render pass to a lower resolution render texture target.

    That won't fix all of your problems as not everything can be solved in that way. Other things that have already been mentioned are things like doing stuff on the CPU. If it's something that is expensive, but the CPU can do once and never have to do again (or do infrequently) then it can be a huge win still. Stuff like the outlines you're doing could be much more efficiently handled by having bespoke "edge" parts that are placed by the CPU. @hippocoder mentioned tilemaps which can help with this kind of thing.

    Remaking this in Amplify is not really a "drastic change". It might feel like a drastic change, but they're fundamentally the same kind of system and shader code generated by Amplify doing the same steps as Shader Graph are going to be just as fast or slow as code generated by the other. Really bespoke handwritten shader code won't be significantly faster either ... apart from maybe being faster to compile; Shader Graph generates extremely verbose code even if it compiles down to a similar number of instructions that actually run on the GPU.


    Really I have one question ... which is:

    Is this shader actually slow? And if so on what hardware?


    You've clearly not seen a lot of production node graphs ... I've personally made far more heinous graphs using Unreal Engine 3 a decade ago. This one is fairly banal by comparison.

    Shader Graph is a shader generator. You can convert any shader graph into a traditional vertex fragment shader written in hlsl, because that's how Shader Graph (and Amplify, or even the old Surface Shaders) make shaders that run on the GPU. It's incredibly verbose and ugly code, but it is technically readable code. But looking at the code vs looking at the node graph isn't any more helpful for the topic of optimization.

    Node graphs are still code. They might not seem like it, but they are, and just as legally protected. That said sharing any code or node graphs you want to protect on a public forum ... yeah ... I wouldn't do that. That's going to get copied by someone if it's useful. Which is why I always ask anyone who posts code from paid assets to delete it immediately (unless it's the original author).
     
  10. Qriva

    Qriva

    Joined:
    Jun 30, 2019
    Posts:
    1,314
    * proceeds to essay * :D

    On more serious note, thanks for explanation. I heard already that mobile is way worse in certain areas, so I guess it's all about who is the target and how much we abuse math and textures.
     
  11. killerfrogy

    killerfrogy

    Joined:
    Sep 17, 2013
    Posts:
    12
    Thanks everyone, for your replies. I don't really no where to start in replying to you all. my fault for being away from this thread for a while and letting it build up with replies. I think I'm going to take the time to reply to each of you with separate comments. Well, here goes :D
     
  12. killerfrogy

    killerfrogy

    Joined:
    Sep 17, 2013
    Posts:
    12
    Thank you for taking the time to try and help me with my problem. The shaded corner you refer to isn't always a shaded corner. It could be a left side, top side or corner. I will experiment with replacing this with texture lookups to see the performance change.

    In regards to tidying the shader up you are absolutely right, I do need to take the time to clean it up and make it more readable. I do this sometimes where I forget to slow down and take the time to do things the right way, rather than racing ahead.
     
  13. killerfrogy

    killerfrogy

    Joined:
    Sep 17, 2013
    Posts:
    12
    Thank you for taking the time to try and help me. Here are some zoomed in screenshots with descriptions:

    This section of the shader works out the 'shading' of the tile. It looks at the tiles own light value and it's neighbors light values to determine how to smoothly light itself. It takes a 3x3 matrix to obtain the light values of itself and neighbors. On the left you can see a bunch of polar coordinate nodes, these are used to calculate the distance of any given pixel to a neighbor. For example, the polar coordinates of the neighbor directly above has a center of x=0.5 and y=1.5. Then a series of split nodes are used to take the R component of each of these.

    upload_2021-7-29_19-2-39.png

    Here is a further zoomed in section of these polar nodes being combined into row vectors and then processed with a variable called 'LightAreaOfInfluence' this affects how much of the tiles lighting is affected by its neighbors.
    upload_2021-7-29_19-8-19.png

    Here is where the clamp outputs go:
    In this section each row vector has 2 dot products performed. One with a constant vector (1,1,1) and one with the light values of the rows.
    These 6 dot products are then turned into 2 Vector3s. The top one being the combination of light value dot products and the bottom vector being a combination of const dot products. These new vectors are then dot producted again with their respective vectors(light value vector and const vector).
    upload_2021-7-29_19-10-59.png

    The next step is to divide the light value dot product by the const dot product, essentially calculating the average light value for each pixel.
    upload_2021-7-29_19-17-55.png

    I'll post about the other sections of the shader later.
     
  14. killerfrogy

    killerfrogy

    Joined:
    Sep 17, 2013
    Posts:
    12
    1. Batching is broken because each tiles material is passed different properties unique to each tile, ie it's tile neighbor value

    2. I wanted to do it in a shader as that would move the computation to the gpu and allow for better parallelization.

    3. You are right I do need to look into moving some of the discrete operations such as outlines to the source art.

    My thoughts going into this was that the cpu would be slower to execute the per pixel lighting calculations for each tile which has a 128x128 texture tied to it. Take for instance the mathematical operations shown above for the smooth lighting. At any given time there can be up to ~900 tiles on screen. Calculating those beforehand and storing the result would use a lot of memory. This is actually a pc game, I can get ~160fps with drops to 70fps on my to gtx1060. However there is a lot more code to be added to the game and I am trying to optimize as I go. Once the day night cycle is added the lighting will need to recalculate every frame anyway, so I think keeping it in a shader makes sense. I am not trying to disagree with any point you have made, just offering up the reasons I chose to use a shader. Part of this choice is also because I could not think of a reasonable way to move the smooth lighting calculations to the cpu without massive performance hits. I am very experienced with the Unity Jobs system and Burst but I still do not think they would match the shaders performance for this task. I am already doing the per tile lighting calculations in a few different IParrallelFor jobs. The performance of these jobs is right on the edge of acceptable right now.
     
  15. killerfrogy

    killerfrogy

    Joined:
    Sep 17, 2013
    Posts:
    12
    Thank you for taking the time to try and help me with my problem. Your opening statement around using multiple materials/shaders seems really promising to me. I have thought lots about doing that, however I am unsure on how to combine them back into the result you see above. I am fairly new when it comes to the nitty gritty of shaders. For instance I understand what render passes are, however I do not know how I might split the shader up and have different parts of it compute on different passes.

    I actually do not think the shader is too much of an issue as it is now, but it will continue to grow as I add more features, I think when I add the day/night cycle things will slow down perhaps. Let me post some performance indicators from the profiler and the SRB profiler in my next comment. Just need to restart PC :) I considered moving to amplify as drastic with respect to more simple changes such as node changes and the fact I would need to buy it :D

    I think when looking at this shader, the main portion to be optimized is the smooth lighting calculations. A lot of the rest of the shader could be removed and replaced with textures instead, ie the outlines and faux AO. However replacing this with textures would lose the customizability of the look of the tiles. Ie the darkness and distance of the AO, the strength of the internal highlights and shadows, etc.
     
  16. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,352
    The SRP Batcher does not work like the built in mesh batching tools. It can batch separate materials with different material properties. These may still show as separate “draws” in the frame debugger, but they are much, much faster. Any shader created with Shader Graph is SRP Batching compatible (assuming no custom function nodes that import in a file with additional uniforms).
     
  17. killerfrogy

    killerfrogy

    Joined:
    Sep 17, 2013
    Posts:
    12
    Thanks for the info! That is interesting, I'll post the SRP profiler results in just a sec :) And the standard profiler results
     
  18. killerfrogy

    killerfrogy

    Joined:
    Sep 17, 2013
    Posts:
    12
    So when underground and fully zoomed out (resulting in the greatest number of tiles being on screen) This is what the SRP Batcher Profiler says:
    upload_2021-7-29_20-31-47.png

    Here's what the overall scene looks like (ignore sh**y UI, it's all VERY wip):
    upload_2021-7-29_20-33-0.png
    Here's what a standard frame looks like (sit's on this ms pretty well with monitor set to 60hz refresh rate with vblank every 1):
    upload_2021-7-29_20-37-41.png
    However, every so often the frame ms goes through the roof and the profiler looks like this:
    upload_2021-7-29_20-39-41.png
    According to the hierarchy in the main thread PostLateUpdate.FinishFrameRendering is the longest executing. In the render thread it is UniversalRenderPipeline.RenderSingleCamera: Main Camera. Here's pics of both main thread and render thread:
    upload_2021-7-29_20-42-43.png
    I don't understand why these occasional spikes occur, they seem completely random and seem to happen with no new input, ie no lighting being recalculated, no new chunks being rendered etc. It's always the same culprits as well according to the hierarchy.
     
  19. killerfrogy

    killerfrogy

    Joined:
    Sep 17, 2013
    Posts:
    12
    At the time the spike occurred the SRP Batcher Profiler showed ~31ms, however as stated, there didn't seem to be any new chunks loaded or anything.
     
  20. Antypodish

    Antypodish

    Joined:
    Apr 29, 2014
    Posts:
    10,780
    If you got random spikes, look for garbage collector.
    See the red spots in your profiler.

    Regarding shader corners and sides as texture, you have few options.
    You can create individual texture for each variation of the tile side/corner.
    Or even better, use texture atlas, of variation tiles, and use position offset in attlas, to get relevant type of tile to render.
    Other option is, to use one type of corner, and rotate it as needed by 90, 180, - 90deg. Similar for long sides, but only 90 and - 90 deg.
     
  21. MartinTilo

    MartinTilo

    Unity Technologies

    Joined:
    Aug 16, 2017
    Posts:
    2,461
    Bit late to the party but if you're still hitting this, my 2 cents:
    • From the timeline view it looks like you're spawning new enemies while the spike is occurring, so that may be related?
    • If you don't need the Memory Profiler Module, turn it off. You're paying for it with 6ms in the spike frames, possibly because it has a lot of textures (or other objects) to go through to gather up the stats. Removing the module gets rid of that overhead and at least gets you a clearer picture of what's going on.
     
    bdb400, deus0 and hippocoder like this.