Search Unity

Transferring array data between passes

Discussion in 'Shaders' started by Boolet, Oct 27, 2020.

  1. Boolet

    Boolet

    Joined:
    Feb 18, 2018
    Posts:
    4
    I have a very expensive cloud shader built into a single shader pass. It essentially raymarches from every pixel that the object covers during the fragment shader.

    I want to reduce the cost by raymarching only on every other pixel, or every third, or whatever, and the interpolating over the remaining pixels.

    The process I have in mind is to run the original raymarch during the fragment shader in the first pass, clipping the unwanted pixels, and then take that output into a second pass and fill out the missing pixels using the result from the previous step.

    Here is the test shader I've written so far.

    Code (CSharp):
    1. Shader "Unlit/BufferTest"
    2. {
    3.     Properties
    4.     {
    5.         _Granularity ("Granularity", float) = 5
    6.     }
    7.     SubShader
    8.     {
    9.         CGINCLUDE
    10.  
    11.         #include "UnityCG.cginc"
    12.  
    13.         uniform RWTexture2D<fixed4> _LowRezCloudTex : register(u2);
    14.         float _Granularity;
    15.  
    16.         ENDCG
    17.  
    18.         Tags { "Queue" = "Transparent" "RenderType"="Transparent" }
    19.  
    20.         Pass
    21.         {
    22.             CGPROGRAM
    23.             #pragma vertex vert
    24.             #pragma fragment frag
    25.  
    26.             float4 vert (float4 vertex : POSITION) : SV_POSITION
    27.             {
    28.                 return UnityObjectToClipPos(vertex);
    29.             }
    30.  
    31.             fixed4 frag (float4 vertex : SV_POSITION) : SV_Target
    32.             {
    33.                 int2 modCoords = vertex.xy/_ScreenParams.xy * _Granularity;
    34.                 _LowRezCloudTex[modCoords] = fixed4(vertex.xy/_ScreenParams.xy, 0, 1);
    35.                 return fixed4(0,0,0,1);
    36.             }
    37.             ENDCG
    38.         }
    39.  
    40.         Pass
    41.         {
    42.             CGPROGRAM
    43.             #pragma vertex vert
    44.             #pragma fragment frag
    45.  
    46.             float4 vert (float4 vertex : POSITION) : SV_POSITION
    47.             {
    48.                 return UnityObjectToClipPos(vertex);
    49.             }
    50.  
    51.             fixed4 frag (float4 vertex : SV_POSITION) : SV_Target
    52.             {
    53.                 int2 modCoords = vertex.xy/_ScreenParams.xy * _Granularity;
    54.                 return _LowRezCloudTex[modCoords];
    55.             }
    56.             ENDCG
    57.         }
    58.  
    59.     }
    60. }
    61.  
    As you can see, I've attempted to write to a buffer in the first pass and then read from that buffer in the second pass. This works, but the uniform RWTexture2D<fixed4> appears to only be 16x16 entries in size, which isn't enough. Please help!

    Note: I'm using the old Built-In render pipeline in Unity 2018, if it matters.
     
    Last edited: Oct 28, 2020
  2. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,352
    Not a bad idea, but you’re missing a big issue.

    Skipping every other pixel isn’t actually any faster, because GPUs don’t render one pixel at a time.

    The smallest number of pixels a GPU renders at a time is a 2x2 set. This is a “pixel quad”. If any one pixel in that quad is rendered, they’re all rendered, even if three of them aren’t ever going to be seen. So rendering every other pixel is the same cost as rendering every pixel. Actually might be more expensive because now you’re also paying the cost of the extra code to skip some pixels, but not one.

    So, you could change it to render a checkerboard of quads. Certainly there are techniques that use this to help performance. But if you’re doing it by using
    discard;
    or
    clip(x);
    (which is just
    if (x<0.0) discard;
    ), you’re still not going to save any performance because GPUs also don’t just render a single pixel quad at a time, it renders in waves / warps of 32 or 64 pixels (depending on the GPU). It runs the same shader in all of those pixels, and the cost of the most expensive pixel in the group is how long they all take to finish. More over, calling
    discard
    in the shader means it still it has to run the shader before it knows it doesn’t have to render that pixel, which is too late to improve performance.

    So the trick is if you’re going to render a checkerboard of quads is you need to mask the pixels using stencils or z depth (assuming you can use early depth rejection, which if you output depth from the fragment shader, or use discard, it may not). Then the GPU knows before hand which pixel quads it can avoid rendering as part of the wave, and thus actually improve performance.


    But, honestly, it’s all moot, because rendering to a RW texture is the wrong way to do this. You should instead render to a separate render texture directly, then in the “second pass” (actually a second shader) sample from that render texture. That way you don’t have to do any of that complicated stuff. Just use a lower resolution render texture than the main resolution, and take advantage of hardware bilinear sampling when rendering it back into the main render target. This is pretty common for effects rendering. Search for “off screen particles”. There’s an old Nvidia gems article on it, as well as some more recent ones from Bungie for Destiny (though there are like 3 different versions of the Destiny tech papers, the later two talking about how they didn’t actually use the version they showed off before because it didn’t work).
     
    jacksonkr, bb8_1 and SugoiDev like this.
  3. Boolet

    Boolet

    Joined:
    Feb 18, 2018
    Posts:
    4
    Thanks so much for the explanation, Bgolus!

    That's really interesting. I had no idea that GPUs worked like that. I only started learning how to write shaders last month, and I still have much to learn.

    I haven't looked into the possibility of rendering to a render texture and then sampling from it. So far, I've been focusing on only processes that are part of Unity's built-in pipeline, and with my basic knowledge I couldn't see a way to make only this one shader render to a separate target.

    From my cursory research and limited knowledge, making a single object shader output to a render texture requires either:

    1. A dedicated additional camera in the scene that has a render target set to a render texture asset.
    2. Using a setup script to assign multiple render targets to a single camera.
    3. Calling Blit, which I'm not certain how to use, but which can fill out a render texture when called.

    And beforehand setting up a second shader that reads from the render texture during its pass.

    So if I'm understanding you correctly, if I use any of these routes and simply have a lower resolution render texture, the GPU will use less cycles because it's essentially running a smaller number of pixels in total. That's very handy.

    However, if I want to implement that, how would I go about creating the render texture in a way that still lets me utilize Unity's depth texture? The way the clouds work is by sampling points beginning at the near side of the cloud volume and continuing until they hit the back side or the depth of that pixel (basically).

    If I run Blit on OnPreRender() I don't think I have access to the depth texture for that frame yet, and if I run it OnPostRender() it is rendered on top of everything that doesn't write to the depth buffer.

    Let me share my latest test code, as well as the relevant bit from the cloud shader.

    Test Code:
    Code (CSharp):
    1. Shader "Unlit/BufferTest2"
    2. {
    3.     Properties
    4.     {
    5.     }
    6.     SubShader
    7.     {
    8.         CGINCLUDE
    9.  
    10.         #include "UnityCG.cginc"
    11.  
    12.         RWStructuredBuffer<fixed4> _LowRezCloudBuffer;
    13.         int _BufferWidth;
    14.         int _Subsample;
    15.  
    16.         int bufferIndex(int2 coordinates){
    17.             return coordinates.y * _BufferWidth + coordinates.x;
    18.         }
    19.  
    20.         ENDCG
    21.  
    22.  
    23.         Tags { "RenderType"="Opaque" "Queue"="Geometry"}
    24.         LOD 100
    25.         Cull Back ZWrite On ZTest LEqual
    26.  
    27.         Pass
    28.         {
    29.             CGPROGRAM
    30.             #pragma vertex vert
    31.             #pragma fragment frag
    32.  
    33.             void vert (float4 vertex : POSITION, out float4 outpos : SV_POSITION)
    34.             {
    35.                 outpos = UnityObjectToClipPos(vertex);
    36.             }
    37.  
    38.             fixed4 frag (UNITY_VPOS_TYPE screenPos : VPOS) : SV_Target
    39.             {
    40.  
    41.                 int2 clipCoords = ((screenPos.xy + _Subsample * 0.5) % _Subsample) - _Subsample;
    42.                 clip(clipCoords);
    43.  
    44.                 int index = bufferIndex(screenPos.xy/_Subsample);
    45.                 fixed4 color = fixed4(screenPos.xy/_ScreenParams.xy, 0, 1);
    46.                 _LowRezCloudBuffer[index] = color;
    47.                 return color;
    48.             }
    49.             ENDCG
    50.         }
    51.  
    52.         Tags { "RenderType"="Opaque" }
    53.         LOD 100
    54.         Cull Back ZWrite On ZTest LEqual
    55.  
    56.         Pass
    57.         {
    58.             CGPROGRAM
    59.             #pragma vertex vert
    60.             #pragma fragment frag
    61.  
    62.             fixed4 multisample(float2 screenPos){
    63.                 int baseIndex = bufferIndex(screenPos.xy/_Subsample);
    64.  
    65.                 fixed4 bottomLeft = _LowRezCloudBuffer[baseIndex];
    66.                 fixed4 bottomRight = _LowRezCloudBuffer[baseIndex + 1];
    67.                 fixed4 topLeft = _LowRezCloudBuffer[baseIndex + _BufferWidth];
    68.                 fixed4 topRight = _LowRezCloudBuffer[baseIndex + _BufferWidth + 1];
    69.  
    70.                 fixed4 bottom = lerp(bottomLeft, bottomRight, (screenPos.x % _Subsample)/_Subsample);
    71.                 fixed4 top = lerp(topLeft, topRight, (screenPos.x % _Subsample)/_Subsample);
    72.  
    73.                 return lerp(bottom, top, (screenPos.y % _Subsample)/_Subsample);
    74.             }
    75.  
    76.             void vert (float4 vertex : POSITION, out float4 outpos : SV_POSITION)
    77.             {
    78.                 outpos = UnityObjectToClipPos(vertex);
    79.             }
    80.  
    81.             fixed4 frag (UNITY_VPOS_TYPE screenPos : VPOS) : SV_Target
    82.             {
    83.                 return multisample(screenPos.xy);
    84.             }
    85.             ENDCG
    86.         }
    87.  
    88.     }
    89. }
    90.  
    Here's what the test shader looks like, storing one pixel out of every 3x3 block.


    And here it is without the data readback phase, again storing one pixel out of every 3x3 block.


    The cloud shader is very long, and most of the code is adapted straight from Sebastian Lague (all due credit!), but here is the fragment shader.

    Code (CSharp):
    1. fixed4 frag (v2f i, UNITY_VPOS_TYPE screenPos : VPOS) : SV_Target
    2.             {
    3.                 /*Only sample the cloud infrequently. The intermediate points will be interpolated.*/
    4.                 int2 clipCoords = (((screenPos.xy + _Subsample * 0.5) % _Subsample) - _Subsample);
    5.                 clip(clipCoords * (_Subsample > 1 ? 1 : 0));
    6.  
    7.                 float screenX = remap(screenPos.x/_ScreenParams.x, 0, 1, -1, 1);
    8.                 float screenY = remap(screenPos.y/_ScreenParams.y, 0, 1, -1, 1);
    9.                 float4 clipSpaceVector = float4(screenX, screenY, 1,1);
    10.                 float4 worldSpaceVector = mul(unity_CameraInvProjection, clipSpaceVector);
    11.                 float distToFrag = length(worldSpaceVector.xyz);
    12.  
    13.                 /*precompute data needed for cloud marching*/
    14.                 float3 boundsMax = mul(unity_ObjectToWorld, FRONTBOUND).xyz;
    15.                 float3 boundsMin = mul(unity_ObjectToWorld, BACKBOUND).xyz;
    16.                 float depth = LinearEyeDepth (tex2Dproj(_CameraDepthTexture, UNITY_PROJ_COORD(i.screenPosition)).x) * distToFrag;
    17.  
    18.                 float3 worldDirection = normalize(i.vectorToSurface);
    19.                 float2 boxDistances = rayBoxDist(boundsMin, boundsMax, _WorldSpaceCameraPos.xyz, 1/worldDirection);
    20.                 float distInsideBox = boxDistances.y;
    21.                 float distToBox = max(boxDistances.x, 0);
    22.  
    23.                 clip(depth - distToBox);
    24.  
    25.                 float blueOffset = tex2D(_BlueNoise, float2(screenX, screenY));
    26.                 /*This is the super expensive method that is supposed to be saved by clipping.*/
    27.                 float2 cloudput = cloudMarch(normalize(i.vectorToSurface), depth, boundsMax, boundsMin, blueOffset);
    28.                 float3 lightColoring = (_LightColor0.rgb * _LightColorInfluence) + 1 - _LightColorInfluence;
    29.  
    30.                 fixed4 result = fixed4(cloudput.x * lightColoring * _Color.rgb, 1-cloudput.y * _Color.a);
    31.                 int index = _Subsample > 1 ? bufferIndex(screenPos.xy/_Subsample) : bufferIndex(screenPos.xy);
    32.                 _LowRezCloudBuffer[index] = result;
    33.                 return 0;
    34.             }
    So if it's true that this actually doesn't save on computing power, how would I go about setting it up to use a render texture properly?
     
    bb8_1 likes this.
  4. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,352
    The answer is to use command buffers.

    You can create a command buffer that assigns a render texture as the current render target, then render an existing renderer in the scene with a custom material override (also letting you specify which specific pass of that material's shader to use), and then set the render texture as a shader global for use later. You can then render your object normally sampling from the render texture, or also render it with the command buffer by setting the render target back to the camera's original and rendering again. As long as the command buffer is run during an event after the camera depth texture has been generated, like
    CameraEvent.BeforeForwardAlpha
    , you'll be able to access it the same way as you are now.
     
    bb8_1 likes this.
  5. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,352
    Code (csharp):
    1. // do this OnEnable
    2. int lowResCloudID = Shader.PropertyToID("_LowResCloud"); // low res cloud texture shader variable name
    3. RenderTargetIdentifier rtid = new RenderTargetIdentifier(lowResCloudID); //
    4.  
    5. CommandBuffer cb = new CommandBuffer();
    6. cb.name = "Give me a name";
    7. // -2 means 1/2 of screen resolution
    8. cb.GetTemporaryRT(lowResCloudID, -2, -2, 0, FilterMode.Linear, RenderTextureFormat.Default);
    9. cb.SetRenderTarget(rtid);
    10. cb.DrawRenderer(cloudObjectMeshRenderer, expensiveCloudMaterial, 0, 0 /* or which ever pass the expensive cloud pass is on */);
    11. cb.SetGlobalTexture(lowResCloudID, rtid);
    12.  
    13. myCamera.AddCommandBuffer(CameraEvent.BeforeForwardAlpha, cb);
     
    bb8_1 likes this.
  6. Boolet

    Boolet

    Joined:
    Feb 18, 2018
    Posts:
    4
    Thanks a ton. I'll implement this, and have a lot of fun learning more about the command buffer system!
     
    bb8_1 likes this.