Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice
  2. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

Passing RenderTexture between Kernels on a Compute Shader

Discussion in 'Shaders' started by Cherubim79, Nov 9, 2017.

  1. Cherubim79

    Cherubim79

    Joined:
    May 3, 2014
    Posts:
    56
    My code is set up like so:

    ComputeScript.cs
    Code (csharp):
    1.  
    2.     public ComputeShader compute;
    3.     public RenderTexture result;
    4.     public RenderTexture resultDistances;
    5.     public RenderTexture fontTexture;
    6.     public Texture2D inputFontTexture;
    7.     public Texture2D inputImage;
    8.  
    9.     // Use this for initialization
    10.     void Start () {
    11.         inputImage = Resources.Load<Texture2D>("inputimage");
    12.         int width = inputImage.width;
    13.         int height = inputImage.height;
    14.         int newWidth = 720;
    15.         float scaleFactor = ((float)newWidth / (float)width);
    16.         int newHeight = (int)((float)height * scaleFactor);
    17.  
    18.         Texture2D newTexture = new Texture2D(width, height, TextureFormat.RGB24, false);
    19.         Color[] pixels = inputImage.GetPixels(0, 0, width, height);
    20.         newTexture.SetPixels(0, 0, width, height, pixels);
    21.         newTexture.Apply();
    22.  
    23.         TextureScaler.scale(newTexture, newWidth, newHeight, FilterMode.Point);
    24.  
    25.         inputFontTexture = Resources.Load<Texture2D>("vgafont");
    26.         int fontWidth = inputFontTexture.width;
    27.         int fontHeight = inputFontTexture.height;
    28.  
    29.         int kernel = compute.FindKernel("CSMain");
    30.         result = new RenderTexture(fontWidth, fontHeight, 24);
    31.         result.enableRandomWrite = true;
    32.         result.Create();
    33.  
    34.         compute.SetTexture(kernel, "Result", result);
    35.         compute.SetTexture(kernel, "ImageInput", newTexture);
    36.         compute.SetTexture(kernel, "FontInput", inputFontTexture);
    37.         compute.SetInt("InputCol", 20);
    38.         compute.SetInt("InputRow", 10);
    39.         compute.Dispatch(kernel, fontWidth / 8, fontHeight / 8, 1);
    40.  
    41.         int kernelProcess = compute.FindKernel("CSBestDistance");
    42.        
    43.         compute.SetTexture(kernelProcess, "Result", result);
    44.  
    45.         int bufferSize = 16 * 16 * 256;
    46.         ComputeBuffer buffer = new ComputeBuffer(bufferSize, sizeof(float));
    47.         compute.SetBuffer(kernelProcess, "ResultDistances", buffer);
    48.         compute.Dispatch(kernelProcess, bufferSize / 16, 1, 1);
    49.         float[] buffer2 = new float[bufferSize];
    50.         buffer.GetData(buffer2);
    51.         //Debug.Log(buffer2.Min());
    52.         for (int z = 0; z < bufferSize; z++)
    53.             Debug.Log(buffer2[z]);
    54.         buffer.Dispose();
    55.     }
    56.  
    Compute Shader:
    Code (csharp):
    1.  
    2. #pragma kernel CSMain
    3. #pragma kernel CSBestDistance
    4.  
    5. RWTexture2D<float4> Result;
    6. Texture2D<float4> ImageInput;
    7. Texture2D<float4> FontInput;
    8. int InputCol;
    9. int InputRow;
    10. RWStructuredBuffer<float4> ResultDistances;
    11.  
    12. [numthreads(8,8,1)]
    13. void CSMain (uint3 id : SV_DispatchThreadID)
    14. {
    15.     int xoffset = InputCol * 9;
    16.     int yoffset = InputRow * 16;
    17.     int fontx = id.x % 9;
    18.     int fonty = id.y % 16;
    19.     float2 inputoffset = float2(xoffset + fontx, yoffset + fonty);
    20.  
    21.     float4 fontinput = FontInput[id.xy];
    22.     float4 imageinput = ImageInput[inputoffset];
    23.     float distanceInput = distance(fontinput, imageinput);
    24.     Result[id.xy] = float4(distanceInput, distanceInput, distanceInput, 1);
    25. }
    26.  
    27. [numthreads(16,1,1)]
    28. void CSBestDistance (uint3 id: SV_DispatchThreadID) {
    29.     int x = id.x % 2304;
    30.     int y = (id.x - x) / 2304;
    31.     ResultDistances[id.x] = Result[float2(x,y)].x;
    32. }
    33.  
    vgafont.png is a 2304x4096 PNG image for image matching. I want every 9x16 block of the first image to be compared to every 9x16 block of the vgafont image (it's a permutation of the 9x16 MS-DOS CP437 font).

    When I run the first Dispatch I get back my RenderTexture fine, it shows the distance as grayscale in the Editor, but passing it to the second Dispatch gives basically all white. How do I chain these operations effectively? Or would it be faster to use for loops in the processing of a single kernel (will that dynamically instantiate new threads on the GPU?).
     
  2. Zolden

    Zolden

    Joined:
    May 9, 2014
    Posts:
    141
    It's correct to run SetTexture() for each kernel you want to access the texture.

    I think you might have exceeded the limit of thread groups' number. You have (4096, 1, 1) thread groups on Dispatch(), and [16, 1, 1] threads per group inside the shader. Try (256, 1, 1) thread groups and [256, 1, 1] threads.

    id.x range will remain the same in the kernel.
     
  3. Cherubim79

    Cherubim79

    Joined:
    May 3, 2014
    Posts:
    56
    I changed up my kernel entirely, because the end goal is image comparison. For every 9x16 block of my input I want to match it to every 9x16 block of my font image, and find the best score. I've been playing around with buffers just to see what kind of data I'm getting back, rather than messing with RenderTextures because if I'm calling my program on Start() I'm getting back some weird results sometimes. It's caused me a couple of reboots on this machine.

    So basically I wanted to try with 1 thread and see what I get back from a buffer that just needs to calculate a buffer array telling me which x/y 9x16 block of my font image is best matched to my current x/y 9x16 block of my source image.

    (Sorry for the comments, you can see I've played around with about a million things today by this point.)

    ComputeScript.cs
    Code (csharp):
    1.  
    2. using System.Collections;
    3. using System.Collections.Generic;
    4. using System.Linq;
    5. using UnityEngine;
    6.  
    7. public class ComputeScript : MonoBehaviour {
    8.  
    9.     public ComputeShader compute;
    10.     public RenderTexture result;
    11.     public RenderTexture resultCalculated;
    12.     public RenderTexture fontTexture;
    13.     public Texture2D inputFontTexture;
    14.     public Texture2D inputImage;
    15.     public Vector3[] outputVectors;
    16.  
    17.     public UnityEngine.UI.RawImage uiimage;
    18.  
    19.     // Use this for initialization
    20.     void Start () {
    21.         GetCalculation();
    22.     }
    23.    
    24.     // Update is called once per frame
    25.     void Update () {
    26.        
    27.     }
    28.  
    29.     void GetCalculation() {
    30.  
    31.         inputImage = Resources.Load<Texture2D>("inputimage");
    32.         int width = inputImage.width;
    33.         int height = inputImage.height;
    34.         int newWidth = 720;
    35.         float scaleFactor = ((float)newWidth / (float)width);
    36.         int newHeight = (int)((float)height * scaleFactor);
    37.  
    38.         Texture2D newTexture = new Texture2D(width, height, TextureFormat.RGB24, false);
    39.         Color[] pixels = inputImage.GetPixels(0, 0, width, height);
    40.         newTexture.SetPixels(0, 0, width, height, pixels);
    41.         newTexture.Apply();
    42.  
    43.         TextureScaler.scale(newTexture, newWidth, newHeight, FilterMode.Point);
    44.  
    45.         inputFontTexture = Resources.Load<Texture2D>("vgafont");
    46.         int fontWidth = inputFontTexture.width;
    47.         int fontHeight = inputFontTexture.height;
    48.  
    49.         //result = new RenderTexture(fontWidth, fontHeight, 0, RenderTextureFormat.ARGB32, RenderTextureReadWrite.Default);
    50.         //result.enableRandomWrite = true;
    51.         //result.Create();
    52.         resultCalculated = new RenderTexture(newWidth, newHeight, 0, RenderTextureFormat.ARGB32, RenderTextureReadWrite.Default);
    53.         resultCalculated.enableRandomWrite = true;
    54.         resultCalculated.Create();
    55.  
    56.         int totalRows = newHeight / 16;
    57.         int totalColumns = newWidth / 9;
    58.  
    59.         int kernelCalculate = compute.FindKernel("CSCalculate");
    60.         compute.SetTexture(kernelCalculate, "ImageInput", newTexture);
    61.         compute.SetTexture(kernelCalculate, "FontInput", inputFontTexture);
    62.         //compute.SetTexture(kernelCalculate, "Result", result);
    63.         //compute.SetTexture(kernelCalculate, "ResultCalculated", resultCalculated);
    64.         compute.SetInt("TotalRows", totalRows);
    65.         compute.SetInt("TotalColumns", totalColumns);
    66.  
    67.         ComputeBuffer buffer = new ComputeBuffer(1, sizeof(float) * 3 * totalRows * totalColumns);
    68.         compute.SetBuffer(kernelCalculate, "ResultDistances", buffer);
    69.  
    70.         compute.Dispatch(kernelCalculate, 1, 1, 1);
    71.  
    72.         float[] bufferData = new float[3 * totalRows * totalColumns];
    73.         Vector3[] colors = new Vector3[totalRows * totalColumns];
    74.         buffer.GetData(bufferData);
    75.         for (int z = 0; z < bufferData.Length / 3; z++) {
    76.             int offset = z * 3;
    77.             Vector3 vec = new Vector3(bufferData[offset+0], bufferData[offset+1], bufferData[offset+2]);
    78.             colors[z] = vec;
    79.             Debug.Log(vec.ToString());
    80.         }
    81.         buffer.Release();
    82.         this.outputVectors = colors;
    83.         //newTexture = new Texture2D(totalRows, totalColumns, TextureFormat.ARGB32, false);
    84.         //newTexture.SetPixels(colors);
    85.         //newTexture.Apply();
    86.         //uiimage.texture = newTexture;
    87.  
    88.         //RenderTexture.active = result;
    89.         //newTexture.ReadPixels(new Rect(0,0,newWidth,newHeight), 0, 0);
    90.         //newTexture.Apply();
    91.         //uiimage.texture = newTexture;
    92.  
    93.  
    94.         //int kernel = compute.FindKernel("CSMain");
    95.  
    96.         //compute.SetTexture(kernel, "Result", result);
    97.         //compute.SetTexture(kernel, "ImageInput", newTexture);
    98.         //compute.SetTexture(kernel, "FontInput", inputFontTexture);
    99.         //compute.SetInt("InputCol", 20);
    100.         //compute.SetInt("InputRow", 10);
    101.         //compute.Dispatch(kernel, fontWidth / 8, fontHeight / 8, 1);
    102.  
    103.         //int kernelProcess = compute.FindKernel("CSBestDistance");
    104.        
    105.         //compute.SetTexture(kernelProcess, "Result", result);
    106.  
    107.         //int bufferSize = 16 * 16 * 256;
    108.         //ComputeBuffer buffer = new ComputeBuffer(bufferSize, sizeof(float));
    109.         //compute.SetBuffer(kernelProcess, "ResultDistances", buffer);
    110.         //compute.Dispatch(kernelProcess, bufferSize / 16, 1, 1);
    111.         //float[] buffer2 = new float[bufferSize];
    112.         //buffer.GetData(buffer2);
    113.         ////Debug.Log(buffer2.Min());
    114.         //for (int z = 0; z < bufferSize; z++)
    115.         //    Debug.Log(buffer2[z]);
    116.         //buffer.Dispose();
    117.     }
    118. }
    119.  
    120.  
    Shader:
    Code (csharp):
    1.  
    2. #pragma kernel CSCalculate
    3.  
    4. //RWTexture2D<float3> Result;
    5. //RWTexture2D<float4> ResultCalculated;
    6.  
    7. Texture2D<float3> ImageInput;
    8. Texture2D<float3> FontInput;
    9.  
    10. int TotalRows;
    11. int TotalColumns;
    12.  
    13. //RWTexture2D<float4> ResultDistances;
    14. RWStructuredBuffer<float3> ResultDistances;
    15.  
    16. [numthreads(1,1,1)]
    17. void CSCalculate (uint3 id: SV_DispatchThreadID) {
    18.  
    19.     int InputCol = 0;
    20.     int InputRow = 0;
    21.     for (InputCol = 0;InputCol < TotalColumns;InputCol++) {
    22.         for (InputRow = 0;InputRow < TotalRows;InputRow++) {
    23.  
    24.             int xoffset = InputCol * 9;
    25.             int yoffset = InputRow * 16;
    26.             int bestascii = 0;
    27.             int bestfg = 0;
    28.             int bestbg = 0;
    29.             float bestdistance = 999999.0;
    30.  
    31.             int fontascii = 0;
    32.             int fontbg = 0;
    33.             int fontfg = 0;
    34.             for (fontascii=0;fontascii<256;fontascii++) {
    35.                 for (fontbg=0;fontbg<16;fontbg++) {
    36.                     for (fontfg=0;fontfg<16;fontfg++) {
    37.  
    38.                         float sum = 0;
    39.  
    40.                         int k=0;
    41.                         int j=0;
    42.                         for (k=0;k<16;k++) {
    43.                             for (j=0;j<9;j++) {
    44.                                 //float2 imgoffset = float2(xoffset + j, yoffset + k);
    45.                                 int2 imgoffset = int2(xoffset + j, yoffset + k);
    46.                                 //int imgoffset = ((yoffset + k) * TotalRows);
    47.                                 //imgoffset += (xoffset + j);
    48.                                 float3 imageinput = ImageInput[imgoffset];
    49.                                 int2 fontoffset = int2(fontascii * 9 + j, (fontbg * 256) + (fontfg * 16) + k);
    50.                                 //float2 fontoffset = float2(fontascii * 9 + j, (fontbg * 256) + (fontfg * 16) + k);
    51.                                 //int fontoffset = (fontbg * 256) + (fontfg * 16) + k;
    52.                                 //fontoffset = fontoffset * 2304;
    53.                                 fontoffset += (fontascii*9+j);
    54.                                 float3 fontinput = FontInput[fontoffset];
    55.                                 float r = pow((fontinput.r - imageinput.r),2);
    56.                                 float g = pow((fontinput.g - imageinput.g),2);
    57.                                 float b = pow((fontinput.b - imageinput.b),2);
    58.                                 float dist = sqrt(r+g+b);
    59.                                 //dist = imageinput.x;
    60.  
    61.                                 //dist = 1;
    62.  
    63.                                 //float dist = distance(imageinput, fontinput);
    64.                                 sum += dist;
    65.                                 //Result[fontoffset] = float4(dist,dist,dist,1);
    66.                             }
    67.                         }
    68.  
    69.                         if (sum < bestdistance) {
    70.                             bestdistance = sum;
    71.                             bestfg = fontfg;
    72.                             bestbg = fontbg;
    73.                             bestascii = fontascii;
    74.                         }
    75.                     }
    76.                 }
    77.             }
    78.  
    79.             int resultOffset = InputRow * TotalColumns + InputCol;
    80.             //
    81.             //ResultDistances[resultOffset] = float3(bestascii, bestfg, bestbg);
    82.             ResultDistances[resultOffset] = float3(bestdistance, bestdistance, bestdistance);
    83.             //ResultDistances[resultOffset] = float3(1,1,1);
    84.         }
    85.     }
    86. }
    87.  
    On the above if I set ResultDistances[resultOffset] = float3(1,1,1); I'll get back a proper array of Vector3's all set to 1's. If I'm trying to do by the above method to get the bestdistance I come up with only the first element of my array having a number and the rest coming back as Vector3's of 0's.

    Are there any weird rules I should know about for's, if's, and accessing Texture2D<float3> data? I have one thread going in and I want to access all of the data in my input and font texture during processing. What makes this trickier is that the input could be variable (usually 720xY) and the font is 2304x4096 that I want to compare. Is that not possible with Compute Shaders?
     
  4. Dreamback

    Dreamback

    Joined:
    Jul 29, 2016
    Posts:
    220
    For your first try, I'm pretty sure the problem was that you didn't assign the Result texture to the second kernel until after calling the first kernel's Dispatch - the data was written while the texture was only bound to one kernel, so the other kernel didn't get it. I've shared data between two kernels and a shader on a material, and when I did so I assigned the buffer to all three before doing any Dispatches.

    For the second one, compute.Dispatch(kernelCalculate, 1, 1, 1); tells it to run the number of assigned threads once, and the [numthreads(1,1,1)] means it will only have 1 thread assigned, meaning it won't ever advance id; it'll only run the whole thing once. You aren't using id, but just mentioning, that's just weird :)

    The real problem in the second one is, as far as I can tell, in C# you are defining ResultDistances as an array of 1 single value, with that value having a size of (sizeof(float) * 3 * totalRows * totalColumns). The first value in the constructor is the number of pieces of data, the second is the size of each piece of data. When you tell a shader to write data into buffer[1], it advances the buffer address by the size to write the data there, and buffer[2] writes the data into buffer address plus (size * 2).

    But in the shader you are telling it that ResultDistances is an array of float3's. So when you do ResultDistances[resultOffset], your index will be out of range if resultOffset is anything but 0 since the array only has one value in it, just one really big value.
     
    Last edited: Nov 10, 2017
    Singtaa likes this.
  5. Cherubim79

    Cherubim79

    Joined:
    May 3, 2014
    Posts:
    56
    You were exactly right on the Dispatch problem, I needed to set the buffers prior to the call so that they could pass RWStructuredBuffers back and forth. Thanks for that one Dreamback.

    The second problem has been a massive amount difficult, I've spent the whole weekend trying to figure it out. I'm on a MacBook Pro using Metal, and it seems to have a number of limitations on my machine. You can't have more than 256 threads per thread group, and you can't have a struct with a size greater than 2048 bytes on my machine. This creates lots of problems because I'm doing an intensive amount of calculations. For every 9x16 pixel in my source image, I'm checking against 9x*16y*16fg*16bg*256ascii pixels at a time (9,437,184 operations), finding the 9x16 block with the least Euclidean distance between the input. My source image is a 720x720 texture composed of 80x45 9x16 blocks, so that becomes a total of 33,973,862,400 operations for one image.

    Quite computationally expensive as you can see. I don't expect this to run in a second. :p It takes about 10 minutes to run one image on the version of my app I created in UWP/C# running on the CPU on an i5, but it's nicely threaded and I can call dispatching code to send any UI updates on the UI thread from within a running Task, unlike Unity unfortunately unless you know how to get around that one. I thought perhaps I'd try out GPU Compute Shaders in Unity when I was sold on those videos showing a million particles and thought it might be a panacea of computational development. It's great, no doubt, if you can figure out all of the limitations and what you're wanting to do hopefully fits within those boundaries or you can make it fit somehow.

    The problem you've noticed in my array which I'm seeing now (although I've already been through a few rewrites since then at this point) is the frustration I've had of trying to collapse a several dimension array down to one array and then reference it in its collapsed state. That can be very error prone. So I've tried to figure out ways to chunk my operations down to less sizable amounts. Interestingly enough the sum of distances goes fine but the "if" statement around checking which sum is the lowest, which is the beef of the program, causes inaccurate results. It may just be silently crashing on my GPU without any debug information whatsoever, I'm not sure and don't know how to test for that. I think the GPU also has a timeout for operations that I can't seem to figure out how to change. I want to run some heavy math on the GPU once in a while and go, it could have lots of applications outside of just gaming.

    Even still, let's chunk it down to one 9x16 source block. On your comparison you still need to compare to the same 9*16*16*16*256 pixels for one pass. Yes, a program for creating text-mode ANSI Art can bring a modern GPU down to its knees apparently, you've seen it right here, welcome to the Twilight Zone. (No doubt, it's brought me into the Twilight Zone over the weekend trying to look through scant documentation that doesn't cover all the things I need to know.) Any bit of mathematical computational gain I might have in GPU calculation is offset by coding and calculation time of trying to mess with one dimensional offsets that may or may not cause me problems either because of my own goof ups in figuring out the correct offsets (which I've seemed to correct at this point with some testing and still got back inaccurate results as explained above), or GPU silent errors or timeouts and what not that I can't debug.

    I thought maybe if I could change out for some RWStructuredBuffer<mystruct> where mystruct gives the data type that would allow me to access multidimensional arrays from within my kernel and everything would be great, but that's not the case. 2048 struct size limitation on my machine, and a float takes up 4 bytes at 32-bit, with a float4 at 16 bytes, that makes even a 9x16 block 2304 bytes which is over the limit and causes an error.

    The big thing's the "if" statements though. I've been reading that you need to avoid "ifs" in your shader code if at all possible. The ifs are the very most important part here though. I could run those on the CPU sure with LINQ statements I suppose, but then I'm sure I wouldn't gain all that much benefit off of trying to run anything against the GPU. The CPU to GPU movements and back are bottlenecks themselves.

    Seems like a long way and many cigarettes smoked to go to hit dead ends and not feel so confident anymore about Compute Shaders other than much much simpler projects. On the funny side I managed to Black Screen of Death my MacBook Pro a couple of times by writing to unprotected GPU memory on accident.
     
  6. Zolden

    Zolden

    Joined:
    May 9, 2014
    Posts:
    141
    Is there a place where all these Metal limitations listed? Or you had to randomly run lots of tests until it works?

    Also, it's strange Unity compiler doesn't report if shader code exceeds Metal limits. When I tried to build for Vulkan, the compiler actually reported what is wrong.
     
  7. Cherubim79

    Cherubim79

    Joined:
    May 3, 2014
    Posts:
    56
    It looks like that's listed here, it may be HLSL specific and not Metal specific I suppose. I'm still trying to gather all this information, and I had to figure that out by trial and error but then looked it up.

    https://msdn.microsoft.com/en-us/library/windows/desktop/ff819065(v=vs.85).aspx

    Comically, I managed to figure out the problem right after posting the above, after one more rewrite of the code, the problem was in how I was accessing the array.

    A little tip for anyone starting out with Compute Shaders if you're having trouble collapsing a 5D or so array down to 1D use a spreadsheet (Numbers, MSExcel, Calc, etc) if you need an figure out the permutations, then it's easier to write those array offsets down. I then suggest putting something in your compute shader code like I did here, just makes things a tad easier:

    uint GetDistanceOffset(uint col, uint fg, uint bg, uint ascii) {
    return (col * 16 * 16 * 8) + (fg * 16 * 8) + (bg * 8) + ascii;
    }

    I could optimize that further with bit shifting I think. I'll have to test and see.

    Many times I'll also need to access to an element of an RWTexture2D<float4> by something other than id.xy. Let's say I need to run many operations on many pixels for one thread of input. You can say something like:

    mytexture[int2(offsetx, offsety)] = float4(1,1,1,1);

    Works perfectly, id.x and id.y just let you figure out which thread is running.



    Now that I've got my Image2Ansi converter working fabulously and it runs in seconds rather than minutes, I need to move on to the pre-dithering phase. That one's a little tricky, I'm using a Jarvis-Judice-Ninke dithering algorithm on the original CPU version to get from 24-bit color dithered down to 16 EGA colors. It has to do a lot of array bounds checking.
     
    Zolden likes this.