Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

Burst loop vectorizing support for noise

Discussion in 'Burst' started by jaydenm, Dec 24, 2020.

  1. jaydenm

    jaydenm

    Joined:
    Nov 16, 2018
    Posts:
    11
    Hi there,

    I am trying to improve the performance of my terrain generation by utilizing vectorized loops with Burst where possible. I've found the function Unity.Burst.CompilerServices.Loop.ExpectVectorized(); that I can use to assert at compile time whether the loop is being vectorized as expected which is working great.

    Using that assertion, I've narrowed down the non-vectorizable code to my calls to noise.snoise(float2) from the Unity.Mathematics package.
    Here is an example of what I am doing to generate the noise values:

    Code (CSharp):
    1.  
    2. [MethodImpl(MethodImplOptions.AggressiveInlining)]
    3. private static unsafe void GenerateNoise(float* valuesPtr, int numValues)
    4. {
    5.     for (var i = 0; i < numValues; i++)
    6.     {
    7.         Unity.Burst.CompilerServices.Loop.ExpectVectorized();
    8.  
    9.         valuesPtr[i] = noise.snoise(new float2(i, 1));
    10.     }
    11. }
    12.  
    Doing something simple in that loop like valuesPtr = i; definitely works with the assertion and runs as expected.

    I've pulled the source for the snoise function and started to inline it, which was mostly working. Before I went to all of the effort to do that though, I wanted to ask for peoples thoughts:
    1. Should the noise.snoise calls be vectorizable by the Burst compiler? Am I missing anything here, or doing anything wrong which is stopping it from working.
    2. Is it worth heading down the path of in-lining the noise code or finding alternatives that support vectorization? Or is this a dead-end steet and isn't possible for some reason
    Thanks
     
  2. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    The implementation uses types float2, float3, and float4. Those types typically break Burst auto-vectorization and sort of enter a "manual vectorization" mode. It is possible to write a vectorized version of this code, but it is not trivial.
     
  3. Lieene-Guo

    Lieene-Guo

    Joined:
    Aug 20, 2013
    Posts:
    547
    In fact,the function is not simple at all
    noise.snoise(new float2(i, 1))
    is ether inline (expanded) or invoked.
    vectorizable loop can not have any jump/branch in code execution. And snoise contains branches. so in both case it's none-vectorizable.
    EDIT: As I review the source. snoise does not contain branch. but it's just too complicated.
    Code (CSharp):
    1.  public static float snoise(float2 v)
    2.         {
    3.             float4 C = float4(0.211324865405187f,  // (3.0-math.sqrt(3.0))/6.0
    4.                                   0.366025403784439f,  // 0.5*(math.sqrt(3.0)-1.0)
    5.                                  -0.577350269189626f,  // -1.0 + 2.0 * C.x
    6.                                   0.024390243902439f); // 1.0 / 41.0
    7.             // First corner
    8.             float2 i = floor(v + dot(v, C.yy));
    9.             float2 x0 = v - i + dot(i, C.xx);
    10.  
    11.             // Other corners
    12.             float2 i1;
    13.             //i1.x = math.step( x0.y, x0.x ); // x0.x > x0.y ? 1.0 : 0.0
    14.             //i1.y = 1.0 - i1.x;
    15.             i1 = (x0.x > x0.y) ? float2(1.0f, 0.0f) : float2(0.0f, 1.0f);
    16.             // x0 = x0 - 0.0 + 0.0 * C.xx ;
    17.             // x1 = x0 - i1 + 1.0 * C.xx ;
    18.             // x2 = x0 - 1.0 + 2.0 * C.xx ;
    19.             float4 x12 = x0.xyxy + C.xxzz;
    20.             x12.xy -= i1;
    21.  
    22.             // Permutations
    23.             i = mod289(i); // Avoid truncation effects in permutation
    24.             float3 p = permute(permute(i.y + float3(0.0f, i1.y, 1.0f)) + i.x + float3(0.0f, i1.x, 1.0f));
    25.  
    26.             float3 m = max(0.5f - float3(dot(x0, x0), dot(x12.xy, x12.xy), dot(x12.zw, x12.zw)), 0.0f);
    27.             m = m * m;
    28.             m = m * m;
    29.  
    30.             // Gradients: 41 points uniformly over a line, mapped onto a diamond.
    31.             // The ring size 17*17 = 289 is close to a multiple of 41 (41*7 = 287)
    32.  
    33.             float3 x = 2.0f * frac(p * C.www) - 1.0f;
    34.             float3 h = abs(x) - 0.5f;
    35.             float3 ox = floor(x + 0.5f);
    36.             float3 a0 = x - ox;
    37.  
    38.             // Normalise gradients implicitly by scaling m
    39.             // Approximation of: m *= inversemath.sqrt( a0*a0 + h*h );
    40.             m *= 1.79284291400159f - 0.85373472095314f * (a0 * a0 + h * h);
    41.  
    42.             // Compute final noise value at P
    43.  
    44.             float  gx = a0.x * x0.x + h.x * x0.y;
    45.             float2 gyz = a0.yz * x12.xz + h.yz * x12.yw;
    46.             float3 g = float3(gx,gyz);
    47.  
    48.             return 130.0f * dot(m, g);
    49.         }
     
    Last edited: Dec 24, 2020
  4. jaydenm

    jaydenm

    Joined:
    Nov 16, 2018
    Posts:
    11
    Thanks for the quick replies, do you know of any alternative noise functions/libraries/algorithms that support vectorisation.
    At the scale I’m trying to generate for, this noise function has a decent performance impact.
     
  5. Lieene-Guo

    Lieene-Guo

    Joined:
    Aug 20, 2013
    Posts:
    547
    this is pretty much the standard nose from GLSL.
    if you want it to run faster. consider generate a noise buffer and access it by index. It's the same technique as noise map.
     
  6. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    It is difficult to suggest something because we don't know how the noise is being used and what properties of this noise are strongly desired. If it were me and a 3-4x improvement was worth a few hours of effort, I would do the following:
    1) Convert the float2, 3, and 4 into multiple floats, such as float2 v becoming float v_x and v_y.
    2) Reimplement the function calls for dot products and the permute function.
    3) Change the conditional operator to use math.select.
    4) Check if autovectorization works. If not, continue.
    5) Change v_x and v_y to be float4 instead of float.
    6) Change local variables to also be float4 wherever the compiler complains about assigning a float4 to a float.
    7) Modify your callsite appropriately
    8) Benchmark
     
    SenseEater likes this.
  7. jaydenm

    jaydenm

    Joined:
    Nov 16, 2018
    Posts:
    11
    Thanks for the detailed step by step, that will definitely help going down that path.

    Is there any good place to see why the compiler chose not to vectorize the code? The ExpectVectorized assertion is helpful, but it doesn't give you any indication of what specific code triggered it, or why. Which leaves a process of elimination to find the culprit.
     
  8. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    It's under "LLVM IR Optimisation Diagnostics" tab in the Burst inspector. The messages can be pretty cryptic at times.
     
    jaydenm likes this.
  9. jaydenm

    jaydenm

    Joined:
    Nov 16, 2018
    Posts:
    11
    Thank you so much! I can't believe I haven't found this yet. Now I'm wondering if I missed a page of the documentation for it...
     
  10. jaydenm

    jaydenm

    Joined:
    Nov 16, 2018
    Posts:
    11
    What is the best practice for benchmarking these burst calls directly?
    At the moment I'm measuring it loosely via calls from a ECS system Entities.ForEach, but that has somewhat variable performance depending on what else is going on in the game/Unity.