Transferring array data between passes

Boolet · Oct 28, 2020

I have a very expensive cloud shader built into a single shader pass. It essentially raymarches from every pixel that the object covers during the fragment shader.

I want to reduce the cost by raymarching only on every other pixel, or every third, or whatever, and the interpolating over the remaining pixels.

The process I have in mind is to run the original raymarch during the fragment shader in the first pass, clipping the unwanted pixels, and then take that output into a second pass and fill out the missing pixels using the result from the previous step.

Here is the test shader I've written so far.

Code (CSharp):

Shader "Unlit/BufferTest"

{

Properties

{

_Granularity ("Granularity", float) = 5

}

SubShader

{

CGINCLUDE

#include "UnityCG.cginc"

uniform RWTexture2D<fixed4> _LowRezCloudTex : register(u2);

float _Granularity;

ENDCG

Tags { "Queue" = "Transparent" "RenderType"="Transparent" }

Pass

{

CGPROGRAM

#pragma vertex vert

#pragma fragment frag

float4 vert (float4 vertex : POSITION) : SV_POSITION

{

return UnityObjectToClipPos(vertex);

}

fixed4 frag (float4 vertex : SV_POSITION) : SV_Target

{

int2 modCoords = vertex.xy/_ScreenParams.xy * _Granularity;

_LowRezCloudTex[modCoords] = fixed4(vertex.xy/_ScreenParams.xy, 0, 1);

return fixed4(0,0,0,1);

}

ENDCG

}

Pass

{

CGPROGRAM

#pragma vertex vert

#pragma fragment frag

float4 vert (float4 vertex : POSITION) : SV_POSITION

{

return UnityObjectToClipPos(vertex);

}

fixed4 frag (float4 vertex : SV_POSITION) : SV_Target

{

int2 modCoords = vertex.xy/_ScreenParams.xy * _Granularity;

return _LowRezCloudTex[modCoords];

}

ENDCG

}

}

}

As you can see, I've attempted to write to a buffer in the first pass and then read from that buffer in the second pass. This works, but the uniform RWTexture2D<fixed4> appears to only be 16x16 entries in size, which isn't enough. Please help!

Note: I'm using the old Built-In render pipeline in Unity 2018, if it matters.

bgolus · Oct 31, 2020

Boolet said: ↑

The process I have in mind is to run the original raymarch during the fragment shader in the first pass, clipping the unwanted pixels, and then take that output into a second pass and fill out the missing pixels using the result from the previous step.
Click to expand...

Not a bad idea, but you’re missing a big issue.

Skipping every other pixel isn’t actually any faster, because GPUs don’t render one pixel at a time.

The smallest number of pixels a GPU renders at a time is a 2x2 set. This is a “pixel quad”. If any one pixel in that quad is rendered, they’re all rendered, even if three of them aren’t ever going to be seen. So rendering every other pixel is the same cost as rendering every pixel. Actually might be more expensive because now you’re also paying the cost of the extra code to skip some pixels, but not one.

So, you could change it to render a checkerboard of quads. Certainly there are techniques that use this to help performance. But if you’re doing it by using
discard;
or
clip(x);
(which is just
if (x<0.0) discard;
), you’re still not going to save any performance because GPUs also don’t just render a single pixel quad at a time, it renders in waves / warps of 32 or 64 pixels (depending on the GPU). It runs the same shader in all of those pixels, and the cost of the most expensive pixel in the group is how long they all take to finish. More over, calling
discard
in the shader means it still it has to run the shader before it knows it doesn’t have to render that pixel, which is too late to improve performance.

So the trick is if you’re going to render a checkerboard of quads is you need to mask the pixels using stencils or z depth (assuming you can use early depth rejection, which if you output depth from the fragment shader, or use discard, it may not). Then the GPU knows before hand which pixel quads it can avoid rendering as part of the wave, and thus actually improve performance.

But, honestly, it’s all moot, because rendering to a RW texture is the wrong way to do this. You should instead render to a separate render texture directly, then in the “second pass” (actually a second shader) sample from that render texture. That way you don’t have to do any of that complicated stuff. Just use a lower resolution render texture than the main resolution, and take advantage of hardware bilinear sampling when rendering it back into the main render target. This is pretty common for effects rendering. Search for “off screen particles”. There’s an old Nvidia gems article on it, as well as some more recent ones from Bungie for Destiny (though there are like 3 different versions of the Destiny tech papers, the later two talking about how they didn’t actually use the version they showed off before because it didn’t work).

Boolet · Nov 2, 2020

Thanks so much for the explanation, Bgolus!

That's really interesting. I had no idea that GPUs worked like that. I only started learning how to write shaders last month, and I still have much to learn.

I haven't looked into the possibility of rendering to a render texture and then sampling from it. So far, I've been focusing on only processes that are part of Unity's built-in pipeline, and with my basic knowledge I couldn't see a way to make only this one shader render to a separate target.

From my cursory research and limited knowledge, making a single object shader output to a render texture requires either:

1. A dedicated additional camera in the scene that has a render target set to a render texture asset.
2. Using a setup script to assign multiple render targets to a single camera.
3. Calling Blit, which I'm not certain how to use, but which can fill out a render texture when called.

And beforehand setting up a second shader that reads from the render texture during its pass.

So if I'm understanding you correctly, if I use any of these routes and simply have a lower resolution render texture, the GPU will use less cycles because it's essentially running a smaller number of pixels in total. That's very handy.

However, if I want to implement that, how would I go about creating the render texture in a way that still lets me utilize Unity's depth texture? The way the clouds work is by sampling points beginning at the near side of the cloud volume and continuing until they hit the back side or the depth of that pixel (basically).

If I run Blit on OnPreRender() I don't think I have access to the depth texture for that frame yet, and if I run it OnPostRender() it is rendered on top of everything that doesn't write to the depth buffer.

Let me share my latest test code, as well as the relevant bit from the cloud shader.

Test Code:

Code (CSharp):

Shader "Unlit/BufferTest2"

{

Properties

{

}

SubShader

{

CGINCLUDE

#include "UnityCG.cginc"

RWStructuredBuffer<fixed4> _LowRezCloudBuffer;

int _BufferWidth;

int _Subsample;

int bufferIndex(int2 coordinates){

return coordinates.y * _BufferWidth + coordinates.x;

}

ENDCG

Tags { "RenderType"="Opaque" "Queue"="Geometry"}

LOD 100

Cull Back ZWrite On ZTest LEqual

Pass

{

CGPROGRAM

#pragma vertex vert

#pragma fragment frag

void vert (float4 vertex : POSITION, out float4 outpos : SV_POSITION)

{

outpos = UnityObjectToClipPos(vertex);

}

fixed4 frag (UNITY_VPOS_TYPE screenPos : VPOS) : SV_Target

{

int2 clipCoords = ((screenPos.xy + _Subsample * 0.5) % _Subsample) - _Subsample;

clip(clipCoords);

int index = bufferIndex(screenPos.xy/_Subsample);

fixed4 color = fixed4(screenPos.xy/_ScreenParams.xy, 0, 1);

_LowRezCloudBuffer[index] = color;

return color;

}

ENDCG

}

Tags { "RenderType"="Opaque" }

LOD 100

Cull Back ZWrite On ZTest LEqual

Pass

{

CGPROGRAM

#pragma vertex vert

#pragma fragment frag

fixed4 multisample(float2 screenPos){

int baseIndex = bufferIndex(screenPos.xy/_Subsample);

fixed4 bottomLeft = _LowRezCloudBuffer[baseIndex];

fixed4 bottomRight = _LowRezCloudBuffer[baseIndex + 1];

fixed4 topLeft = _LowRezCloudBuffer[baseIndex + _BufferWidth];

fixed4 topRight = _LowRezCloudBuffer[baseIndex + _BufferWidth + 1];

fixed4 bottom = lerp(bottomLeft, bottomRight, (screenPos.x % _Subsample)/_Subsample);

fixed4 top = lerp(topLeft, topRight, (screenPos.x % _Subsample)/_Subsample);

return lerp(bottom, top, (screenPos.y % _Subsample)/_Subsample);

}

void vert (float4 vertex : POSITION, out float4 outpos : SV_POSITION)

{

outpos = UnityObjectToClipPos(vertex);

}

fixed4 frag (UNITY_VPOS_TYPE screenPos : VPOS) : SV_Target

{

return multisample(screenPos.xy);

}

ENDCG

}

}

}

Here's what the test shader looks like, storing one pixel out of every 3x3 block.

And here it is without the data readback phase, again storing one pixel out of every 3x3 block.

The cloud shader is very long, and most of the code is adapted straight from Sebastian Lague (all due credit!), but here is the fragment shader.

Code (CSharp):

fixed4 frag (v2f i, UNITY_VPOS_TYPE screenPos : VPOS) : SV_Target

{

/*Only sample the cloud infrequently. The intermediate points will be interpolated.*/

int2 clipCoords = (((screenPos.xy + _Subsample * 0.5) % _Subsample) - _Subsample);

clip(clipCoords * (_Subsample > 1 ? 1 : 0));

float screenX = remap(screenPos.x/_ScreenParams.x, 0, 1, -1, 1);

float screenY = remap(screenPos.y/_ScreenParams.y, 0, 1, -1, 1);

float4 clipSpaceVector = float4(screenX, screenY, 1,1);

float4 worldSpaceVector = mul(unity_CameraInvProjection, clipSpaceVector);

float distToFrag = length(worldSpaceVector.xyz);

/*precompute data needed for cloud marching*/

float3 boundsMax = mul(unity_ObjectToWorld, FRONTBOUND).xyz;

float3 boundsMin = mul(unity_ObjectToWorld, BACKBOUND).xyz;

float depth = LinearEyeDepth (tex2Dproj(_CameraDepthTexture, UNITY_PROJ_COORD(i.screenPosition)).x) * distToFrag;

float3 worldDirection = normalize(i.vectorToSurface);

float2 boxDistances = rayBoxDist(boundsMin, boundsMax, _WorldSpaceCameraPos.xyz, 1/worldDirection);

float distInsideBox = boxDistances.y;

float distToBox = max(boxDistances.x, 0);

clip(depth - distToBox);

float blueOffset = tex2D(_BlueNoise, float2(screenX, screenY));

/*This is the super expensive method that is supposed to be saved by clipping.*/

float2 cloudput = cloudMarch(normalize(i.vectorToSurface), depth, boundsMax, boundsMin, blueOffset);

float3 lightColoring = (_LightColor0.rgb * _LightColorInfluence) + 1 - _LightColorInfluence;

fixed4 result = fixed4(cloudput.x * lightColoring * _Color.rgb, 1-cloudput.y * _Color.a);

int index = _Subsample > 1 ? bufferIndex(screenPos.xy/_Subsample) : bufferIndex(screenPos.xy);

_LowRezCloudBuffer[index] = result;

return 0;

}

So if it's true that this actually doesn't save on computing power, how would I go about setting it up to use a render texture properly?

bgolus · Nov 3, 2020

The answer is to use command buffers.

You can create a command buffer that assigns a render texture as the current render target, then render an existing renderer in the scene with a custom material override (also letting you specify which specific pass of that material's shader to use), and then set the render texture as a shader global for use later. You can then render your object normally sampling from the render texture, or also render it with the command buffer by setting the render target back to the camera's original and rendering again. As long as the command buffer is run during an event after the camera depth texture has been generated, like
CameraEvent.BeforeForwardAlpha
, you'll be able to access it the same way as you are now.

bgolus · Nov 3, 2020

Code (csharp):

// do this OnEnable

int lowResCloudID = Shader.PropertyToID("_LowResCloud"); // low res cloud texture shader variable name

RenderTargetIdentifier rtid = new RenderTargetIdentifier(lowResCloudID); //

CommandBuffer cb = new CommandBuffer();

cb.name = "Give me a name";

// -2 means 1/2 of screen resolution

cb.GetTemporaryRT(lowResCloudID, -2, -2, 0, FilterMode.Linear, RenderTextureFormat.Default);

cb.SetRenderTarget(rtid);

cb.DrawRenderer(cloudObjectMeshRenderer, expensiveCloudMaterial, 0, 0 /* or which ever pass the expensive cloud pass is on */);

cb.SetGlobalTexture(lowResCloudID, rtid);

myCamera.AddCommandBuffer(CameraEvent.BeforeForwardAlpha, cb);

Boolet · Nov 3, 2020

Thanks a ton. I'll implement this, and have a lot of fun learning more about the command buffer system!

Search Unity

Unity ID

Useful Searches

Transferring array data between passes

Boolet

bgolus

Boolet

bgolus

bgolus

Boolet