Search Unity

  1. Unity 2019.2 is now released.
    Dismiss Notice

Parallel prefix sum ComputeShader

Discussion in 'Shaders' started by cecarlsen, Feb 20, 2018.

  1. cecarlsen

    cecarlsen

    Joined:
    Jun 30, 2006
    Posts:
    527
    I am attempting to implement "Fast Fixed Radius Nearest Neighbours", a method presented by Nvidia in 2013.
    http://on-demand.gputechconf.com/gt...17-fast-fixed-radius-nearest-neighbor-gpu.pdf

    The example pointed to is Fluid v3, which is written in CUDA. It contains an implementation (copyrighted by Nvidia) of a parallel prefix sum algorithm.
    https://github.com/rchoetzlein/fluids3/blob/master/fluids/prefix_sum.cu

    Prefix sum is also called "prefix scan". It takes an array like { 1, 2, 1, 2 } and outputs the accumulative sum from left to right either exclusive { 0, 1, 3, 4 } or inclusive { 1, 3, 4, 6 }.
    https://en.wikipedia.org/wiki/Prefix_sum
    http://www.umiacs.umd.edu/~ramani/cmsc828e_gpusci/ScanTalk.pdf

    It seems like a very basic algorithm for parallel computing, so I am surprised that I can't find a standard implementation for DirectCompute anywhere. I wonder if anyone in the Unity community has already overcome this obstacle.

    EDIT:

    I found a HLSL implementation by Takahiro Harada from 2011. Note that this implementation is limited to 524.288 elements (groupshared size at 2048).
    http://www.heterogeneouscompute.org/?page_id=7
    https://github.com/erwincoumans/exp...ves/AdlPrimitives/Scan/PrefixScanKernels.hlsl

    And this one by MS, limited to 16384 elements (which could probably be increased by upping the groupshared memory from 128).
    https://github.com/walbourn/directx-sdk-samples/blob/master/AdaptiveTessellationCS40/ScanCS.hlsl

    I've been trying to port both, but no luck so far.
     
    Last edited: Feb 20, 2018
  2. customphase

    customphase

    Joined:
    Aug 19, 2012
    Posts:
    159
    Also stumbled into that, ended up just bruteforcing it with loop in one thread. It worked fine for me, but i only had 32k buckets, probably wont scale well with more than that
     
    cecarlsen likes this.
  3. scrawk

    scrawk

    Joined:
    Nov 22, 2012
    Posts:
    765
    What part is causing you issues exactly?

    The two hlsl shaders look like they should work in unity with minimal changes.
     
  4. cecarlsen

    cecarlsen

    Joined:
    Jun 30, 2006
    Posts:
    527
    I just got ScanCS by MS working. Takahiro's version kept crashing Unity, and I still don't understand why. The most difficult part was to translate how the "shader resources views" (StructuredBuffer) and "unordered access views" (RWStructuredBuffer) were set from this host file:
    https://github.com/walbourn/directx-sdk-samples/blob/master/AdaptiveTessellationCS40/ScanCS.cpp

    That ended up looking like this:

    Code (CSharp):
    1. // Compute how many thread groups we will need.
    2. int threadGroupCount = count / threadsPerGroup;
    3.  
    4. // ScanInBucket.
    5. int scanInBucketKernel = exclusive ? _scanInBucketExclusiveKernel : _scanInBucketInclusiveKernel;
    6. _computeShader.SetBuffer( scanInBucketKernel, _inputPropId, countBuffer.computeBuffer );
    7. _computeShader.SetBuffer( scanInBucketKernel, _resultPropId, resultBuffer.computeBuffer );
    8. _computeShader.Dispatch( scanInBucketKernel, threadGroupCount, 1, 1 );
    9.  
    10. // ScanBucketResult.
    11. _computeShader.SetBuffer( _scanBucketResultKernel, _inputPropId, resultBuffer.computeBuffer );
    12. _computeShader.SetBuffer( _scanBucketResultKernel, _resultPropId, _auxBuffer.computeBuffer );
    13. _computeShader.Dispatch( _scanBucketResultKernel, 1, 1, 1 );
    14.  
    15. // ScanAddBucketResult.
    16. _computeShader.SetBuffer( _scanAddBucketResultKernel, _inputPropId, _auxBuffer.computeBuffer );
    17. _computeShader.SetBuffer( _scanAddBucketResultKernel, _resultPropId, resultBuffer.computeBuffer );
    18. _computeShader.Dispatch( _scanAddBucketResultKernel, threadGroupCount, 1, 1 );

    To make it an "exclusive scan", I added a reading offset in the first kernel:

    Code (CSharp):
    1. [numthreads( THREADS_PER_GROUP, 1, 1 )]
    2. void ScanInBucketExclusive( uint DTid : SV_DispatchThreadID, uint GI: SV_GroupIndex ) // CSScanInBucket
    3. {
    4.     uint x = DTid == 0 ? 0 : _Input[ DTid-1 ];
    5.     CSScan( DTid, GI, x );
    6. }
    BTW @scrawk, thank you for the shader tutorials you once uploaded, they where a great resources when I was starting to get into all this.
     
    Last edited: Feb 22, 2018
    marcell_h likes this.
  5. cecarlsen

    cecarlsen

    Joined:
    Jun 30, 2006
    Posts:
    527
    I was asked to share the full code, so here goes. Luckily, it was MIT licensed in the first place.

    Code (CSharp):
    1. /*
    2.     The MIT License (MIT)
    3.  
    4.     Copyright (c) 2004-2019 Microsoft Corp
    5.     Modified by Carl Emil Carlsen 2018.
    6.  
    7.     Permission is hereby granted, free of charge, to any person obtaining a copy of this
    8.     software and associated documentation files (the "Software"), to deal in the Software
    9.     without restriction, including without limitation the rights to use, copy, modify,
    10.     merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
    11.     permit persons to whom the Software is furnished to do so, subject to the following
    12.     conditions:
    13.  
    14.     The above copyright notice and this permission notice shall be included in all copies
    15.     or substantial portions of the Software.
    16.  
    17.     THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
    18.     INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
    19.     PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
    20.     HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF
    21.     CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE
    22.     OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
    23.      
    24.     From directx-sdk-samples by Chuck Walbourn:
    25.     https://github.com/walbourn/directx-sdk-samples/blob/master/AdaptiveTessellationCS40/ScanCS.hlsl
    26. */
    27.  
    28. #pragma kernel ScanInBucketInclusive
    29. #pragma kernel ScanInBucketExclusive
    30. #pragma kernel ScanBucketResult
    31. #pragma kernel ScanAddBucketResult
    32.  
    33. #define THREADS_PER_GROUP 512 // Ensure that this equals the 'threadsPerGroup' const in the host script.
    34.  
    35. StructuredBuffer<uint> _Input;
    36. RWStructuredBuffer<uint> _Result;
    37.  
    38. groupshared uint2 bucket[THREADS_PER_GROUP];
    39.  
    40. void CSScan( uint3 DTid, uint GI, uint x )
    41. {
    42.     // since CS40 can only support one shared memory for one shader, we use .xy and .zw as ping-ponging buffers
    43.     // if scan a single element type like int, search and replace all .xy to .x and .zw to .y below
    44.     bucket[GI].x = x;
    45.     bucket[GI].y = 0;
    46.  
    47.     // Up sweep  
    48.     [unroll]
    49.     for( uint stride = 2; stride <= THREADS_PER_GROUP; stride <<= 1 )
    50.     {
    51.         GroupMemoryBarrierWithGroupSync();
    52.         if ( (GI & (stride - 1)) == (stride - 1) ) bucket[GI].x += bucket[GI - stride/2].x;
    53.     }
    54.  
    55.     if( GI == (THREADS_PER_GROUP - 1) ) bucket[GI].x = 0;
    56.  
    57.     // Down sweep
    58.     bool n = true;
    59.     [unroll]
    60.     for( stride = THREADS_PER_GROUP / 2; stride >= 1; stride >>= 1 )
    61.     {
    62.         GroupMemoryBarrierWithGroupSync();
    63.  
    64.         uint a = stride - 1;
    65.         uint b = stride | a;
    66.  
    67.         if( n )        // ping-pong between passes
    68.         {
    69.             if( ( GI & b) == b )
    70.             {
    71.                 bucket[GI].y = bucket[GI-stride].x + bucket[GI].x;
    72.             } else
    73.             if( (GI & a) == a )
    74.             {
    75.                 bucket[GI].y = bucket[GI+stride].x;
    76.             } else      
    77.             {
    78.                 bucket[GI].y = bucket[GI].x;
    79.             }
    80.         } else {
    81.             if( ( GI & b) == b )
    82.             {
    83.                 bucket[GI].x = bucket[GI-stride].y + bucket[GI].y;
    84.             } else
    85.             if( (GI & a) == a )
    86.             {
    87.                 bucket[GI].x = bucket[GI+stride].y;
    88.             } else      
    89.             {
    90.                 bucket[GI].x = bucket[GI].y;
    91.             }
    92.         }
    93.      
    94.         n = !n;
    95.     }  
    96.    
    97.     _Result[DTid.x] = bucket[GI].y + x;
    98. }
    99.  
    100.  
    101. // Scan in each bucket.
    102. [numthreads( THREADS_PER_GROUP, 1, 1 )]
    103. void ScanInBucketInclusive( uint DTid : SV_DispatchThreadID, uint GI: SV_GroupIndex ) // CSScanInBucket
    104. {
    105.     uint x = _Input[DTid];
    106.     CSScan( DTid, GI, x );
    107. }
    108.  
    109. // Scan in each bucket.
    110. [numthreads( THREADS_PER_GROUP, 1, 1 )]
    111. void ScanInBucketExclusive( uint DTid : SV_DispatchThreadID, uint GI: SV_GroupIndex ) // CSScanInBucket
    112. {
    113.     uint x = DTid == 0 ? 0 : _Input[ DTid-1 ];
    114.     CSScan( DTid, GI, x );
    115. }
    116.  
    117.  
    118. // Record and scan the sum of each bucket.
    119. [numthreads( THREADS_PER_GROUP, 1, 1 )]
    120. void ScanBucketResult( uint DTid : SV_DispatchThreadID, uint GI: SV_GroupIndex )
    121. {
    122.     uint x = _Input[DTid*THREADS_PER_GROUP - 1];
    123.     CSScan( DTid, GI, x );
    124. }
    125.  
    126.  
    127. // Add the bucket scanned result to each bucket to get the final result.
    128. [numthreads( THREADS_PER_GROUP, 1, 1 )]
    129. void ScanAddBucketResult( uint Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID )
    130. {
    131.     _Result[DTid.x] = _Result[DTid.x] + _Input[Gid];
    132. }
     
  6. Mytino

    Mytino

    Joined:
    Jan 28, 2019
    Posts:
    5
    Edit 2: Never mind! Somehow I thought the csharp code you posted earlier in the thread was different ways to use the shader and not a sequence. It works now! And it's super fast :) Thanks again!

    Thank you so much for posting this! I've been using it for my SPH simulation based on the same talk as you. However, I've been wondering. Is this code limited to a compute buffer with THREADS_PER_GROUP elements? GPUs won't allow you to set that higher than 1024, and my grid search can have more than 1024 elements (in the count buffer with one count per grid cell). A 10x10x10 grid is already 1000 elements, and I assume it's usually much bigger. Do I have to split the buffer up? Or is there something I'm missing? It's working perfectly when the element count is less than or equal to THREADS_PER_GROUP.

    Edit: Here's my code for dispatch. Do I have to use any of the other kernels as well? (Btw, I changed the naming of _Input to InputBuffer and _Result to ResultBuffer)
    Code (CSharp):
    1. kernelIndex = scanCs.FindKernel("ScanInBucketExclusive");
    2. scanCs.SetBuffer(kernelIndex, "InputBuffer", countBuffer);
    3. scanCs.SetBuffer(kernelIndex, "ResultBuffer", minIndexInCellBuffer);
    4. scanCs.Dispatch(kernelIndex, gridThreadGroups.x, gridThreadGroups.y, gridThreadGroups.z);
     
    Last edited: Oct 10, 2019
    cecarlsen likes this.