Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

Question How to best organize and refactor intrinsics code for readability and re-use while maintaining perf

Discussion in 'Burst' started by Per-Morten, Apr 7, 2022.

  1. Per-Morten

    Per-Morten

    Joined:
    Aug 23, 2019
    Posts:
    109
    I've been writing up some performance-sensitive routines in AVX lately and I'm struggling a bit with making it more "high-level" readable and facilitating code reuse while at the same time ensuring that burst generates the code that I want. For instance, the InitialJob here is a part of a routine I wrote where I create a clip space AABB of an AABB in world space. For this part burst generates pretty good code, unrolling the loop, etc.

    However, I need to do the exact same transformation to clip space elsewhere (same with the matrix multiply and perspective division), so it would be nice to abstract it out to helper functions somehow.

    I tried to do something like the RefactoredJob, but it's pretty difficult to refactor it like that without confusing the compiler. I have to aggressively inline essentially all my functions, usually have to remember to specify my arguments as 'in' arguments, and unless I explicitly specify the layout of Float3V256 the loop in CreateClipSpaceAABB isn't unrolled properly. Even in the cases where I do all that there are situations where the compiler trips up and I have no clue why. To me, this makes these refactorings quite fragile and a bit scary as someone in the future might do something as simple as changing the explicit struct layout of Float3V256 to a sequential layout (which is seemingly the same?) and suddenly generate worse code in some performance-critical path. There's also the aspect of having to guard all the AVX instructions in 'if (IsXXXSupported)' (omitted here for brevity) which is a bit annoying when there are helper functions and abstractions that are only relevant for the AVX path.

    So, I'm wondering, are there any best practices or anything on how to tackle these problems? So far we've kinda decided to just say 'screw it' and write the AVX code "abstraction free" since it isn't a tool we use unless we really need to and the code will rarely change. Still, it would be nice to hear if anyone has any experience with this and how they've dealt with it.

    Code (CSharp):
    1.  
    2. [StructLayout(LayoutKind.Sequential)]
    3. public struct Float8
    4. {
    5.     public float v0;
    6.     public float v1;
    7.     public float v2;
    8.     public float v3;
    9.     public float v4;
    10.     public float v5;
    11.     public float v6;
    12.     public float v7;
    13. }
    14.  
    15. [StructLayout(LayoutKind.Sequential)]
    16. public struct Float8x3
    17. {
    18.     public Float8 x;
    19.     public Float8 y;
    20.     public Float8 z;
    21. }
    22.  
    23. [StructLayout(LayoutKind.Sequential)]
    24. public struct AABB8
    25. {
    26.     public Float8x3 min;
    27.     public Float8x3 max;
    28. }
    29.      
    30. [BurstCompile]
    31. public struct InitialJob
    32.     : IJobParallelFor
    33. {
    34.     [ReadOnly]
    35.     public float4x4 World_to_Clip;
    36.  
    37.     [ReadOnly]
    38.     public NativeArray<AABB8> AABB_in_World;
    39.  
    40.     [WriteOnly]
    41.     public NativeArray<AABB8> AABB_in_Clip;
    42.  
    43.     public unsafe void Execute(int index)
    44.     {
    45.         var world_to_clip_c0_x = mm256_set1_ps(World_to_Clip.c0.x);
    46.         var world_to_clip_c0_y = mm256_set1_ps(World_to_Clip.c0.y);
    47.         var world_to_clip_c0_z = mm256_set1_ps(World_to_Clip.c0.z);
    48.         var world_to_clip_c0_w = mm256_set1_ps(World_to_Clip.c0.w);
    49.  
    50.         var world_to_clip_c1_x = mm256_set1_ps(World_to_Clip.c1.x);
    51.         var world_to_clip_c1_y = mm256_set1_ps(World_to_Clip.c1.y);
    52.         var world_to_clip_c1_z = mm256_set1_ps(World_to_Clip.c1.z);
    53.         var world_to_clip_c1_w = mm256_set1_ps(World_to_Clip.c1.w);
    54.  
    55.         var world_to_clip_c2_x = mm256_set1_ps(World_to_Clip.c2.x);
    56.         var world_to_clip_c2_y = mm256_set1_ps(World_to_Clip.c2.y);
    57.         var world_to_clip_c2_z = mm256_set1_ps(World_to_Clip.c2.z);
    58.         var world_to_clip_c2_w = mm256_set1_ps(World_to_Clip.c2.w);
    59.  
    60.         var world_to_clip_c3_x = mm256_set1_ps(World_to_Clip.c3.x);
    61.         var world_to_clip_c3_y = mm256_set1_ps(World_to_Clip.c3.y);
    62.         var world_to_clip_c3_z = mm256_set1_ps(World_to_Clip.c3.z);
    63.         var world_to_clip_c3_w = mm256_set1_ps(World_to_Clip.c3.w);
    64.  
    65.         var aabb_in_world = AABB_in_World[index];
    66.         var aabb_min_in_world_x = mm256_loadu_ps(&aabb_in_world.min.x);
    67.         var aabb_min_in_world_y = mm256_loadu_ps(&aabb_in_world.min.y);
    68.         var aabb_min_in_world_z = mm256_loadu_ps(&aabb_in_world.min.z);
    69.  
    70.         var aabb_max_in_world_x = mm256_loadu_ps(&aabb_in_world.max.x);
    71.         var aabb_max_in_world_y = mm256_loadu_ps(&aabb_in_world.max.y);
    72.         var aabb_max_in_world_z = mm256_loadu_ps(&aabb_in_world.max.z);
    73.  
    74.         var aabb_min_in_screen_x = mm256_set1_ps(float.MaxValue);
    75.         var aabb_min_in_screen_y = mm256_set1_ps(float.MaxValue);
    76.         var aabb_min_in_screen_z = mm256_set1_ps(float.MaxValue);
    77.  
    78.         var aabb_max_in_screen_x = mm256_set1_ps(float.MinValue);
    79.         var aabb_max_in_screen_y = mm256_set1_ps(float.MinValue);
    80.         var aabb_max_in_screen_z = mm256_set1_ps(float.MinValue);
    81.  
    82.         var max_mask = mm256_set1_epi32(-1);
    83.         var min_mask = mm256_setzero_si256();
    84.  
    85.         for (int i = 0; i < 8; i++)
    86.         {
    87.             // Create all combinations of 8 vertices making up the bounding box
    88.             var x_mask = (i & (1 << 2)) == 0 ? min_mask : max_mask;
    89.             var y_mask = (i & (1 << 1)) == 0 ? min_mask : max_mask;
    90.             var z_mask = (i & (1 << 0)) == 0 ? min_mask : max_mask;
    91.             var v_x = mm256_blendv_ps(aabb_min_in_world_x, aabb_max_in_world_x, x_mask);
    92.             var v_y = mm256_blendv_ps(aabb_min_in_world_y, aabb_max_in_world_y, y_mask);
    93.             var v_z = mm256_blendv_ps(aabb_min_in_world_z, aabb_max_in_world_z, z_mask);
    94.  
    95.             // Matrix multiply. v_w is not part of the calculation since it would just be 1 anyway, so we can just use World_to_Clip_c3_[xyzw] rather than muliplying it with v_w.
    96.             var v_in_clip_x = mm256_fmadd_ps(world_to_clip_c0_x, v_x, mm256_fmadd_ps(world_to_clip_c1_x, v_y, mm256_fmadd_ps(world_to_clip_c2_x, v_z, world_to_clip_c3_x)));
    97.             var v_in_clip_y = mm256_fmadd_ps(world_to_clip_c0_y, v_x, mm256_fmadd_ps(world_to_clip_c1_y, v_y, mm256_fmadd_ps(world_to_clip_c2_y, v_z, world_to_clip_c3_y)));
    98.             var v_in_clip_z = mm256_fmadd_ps(world_to_clip_c0_z, v_x, mm256_fmadd_ps(world_to_clip_c1_z, v_y, mm256_fmadd_ps(world_to_clip_c2_z, v_z, world_to_clip_c3_z)));
    99.             var v_in_clip_w = mm256_fmadd_ps(world_to_clip_c0_w, v_x, mm256_fmadd_ps(world_to_clip_c1_w, v_y, mm256_fmadd_ps(world_to_clip_c2_w, v_z, world_to_clip_c3_w)));
    100.             var w_rcp = mm256_rcp_ps(v_in_clip_w);
    101.  
    102.             var v_in_clip_normalized_x = mm256_mul_ps(v_in_clip_x, w_rcp);
    103.             var v_in_clip_normalized_y = mm256_mul_ps(v_in_clip_y, w_rcp);
    104.             var v_in_clip_normalized_z = mm256_mul_ps(v_in_clip_z, w_rcp);
    105.  
    106.             aabb_min_in_screen_x = mm256_min_ps(aabb_min_in_screen_x, v_in_clip_normalized_x);
    107.             aabb_min_in_screen_y = mm256_min_ps(aabb_min_in_screen_y, v_in_clip_normalized_y);
    108.             aabb_min_in_screen_z = mm256_min_ps(aabb_min_in_screen_z, v_in_clip_normalized_z);
    109.  
    110.             aabb_max_in_screen_x = mm256_max_ps(aabb_max_in_screen_x, v_in_clip_normalized_x);
    111.             aabb_max_in_screen_y = mm256_max_ps(aabb_max_in_screen_y, v_in_clip_normalized_y);
    112.             aabb_max_in_screen_z = mm256_max_ps(aabb_max_in_screen_z, v_in_clip_normalized_z);
    113.         }
    114.  
    115.         var res = (AABB8)default;
    116.         mm256_storeu_ps(&res.min.x, aabb_min_in_screen_x);
    117.         mm256_storeu_ps(&res.min.y, aabb_min_in_screen_y);
    118.         mm256_storeu_ps(&res.min.z, aabb_min_in_screen_z);
    119.  
    120.         mm256_storeu_ps(&res.max.x, aabb_max_in_screen_x);
    121.         mm256_storeu_ps(&res.max.y, aabb_max_in_screen_y);
    122.         mm256_storeu_ps(&res.max.z, aabb_max_in_screen_z);
    123.         AABB_in_Clip[index] = res;
    124.     }
    125. }
    126.  

    Code (CSharp):
    1.  
    2. [StructLayout(LayoutKind.Sequential)]
    3. public struct Float8
    4. {
    5.     public float v0;
    6.     public float v1;
    7.     public float v2;
    8.     public float v3;
    9.     public float v4;
    10.     public float v5;
    11.     public float v6;
    12.     public float v7;
    13. }
    14.  
    15. [StructLayout(LayoutKind.Sequential)]
    16. public struct Float8x3
    17. {
    18.     public Float8 x;
    19.     public Float8 y;
    20.     public Float8 z;
    21. }
    22.  
    23. [StructLayout(LayoutKind.Sequential)]
    24. public struct AABB8
    25. {
    26.     public Float8x3 min;
    27.     public Float8x3 max;
    28. }
    29.  
    30. [StructLayout(LayoutKind.Explicit)]
    31. public struct Float4x4V256
    32. {
    33.     [FieldOffset(0)] public v256 c0_x;
    34.     [FieldOffset(32)] public v256 c0_y;
    35.     [FieldOffset(64)] public v256 c0_z;
    36.     [FieldOffset(96)] public v256 c0_w;
    37.  
    38.     [FieldOffset(128)] public v256 c1_x;
    39.     [FieldOffset(160)] public v256 c1_y;
    40.     [FieldOffset(192)] public v256 c1_z;
    41.     [FieldOffset(224)] public v256 c1_w;
    42.  
    43.     [FieldOffset(256)] public v256 c2_x;
    44.     [FieldOffset(288)] public v256 c2_y;
    45.     [FieldOffset(320)] public v256 c2_z;
    46.     [FieldOffset(352)] public v256 c2_w;
    47.  
    48.     [FieldOffset(384)] public v256 c3_x;
    49.     [FieldOffset(416)] public v256 c3_y;
    50.     [FieldOffset(448)] public v256 c3_z;
    51.     [FieldOffset(480)] public v256 c3_w;
    52.  
    53.     [MethodImpl(MethodImplOptions.AggressiveInlining)]
    54.     public Float4x4V256(in float4x4 mat)
    55.     {
    56.         c0_x = mm256_set1_ps(mat.c0.x);
    57.         c0_y = mm256_set1_ps(mat.c0.y);
    58.         c0_z = mm256_set1_ps(mat.c0.z);
    59.         c0_w = mm256_set1_ps(mat.c0.w);
    60.  
    61.         c1_x = mm256_set1_ps(mat.c1.x);
    62.         c1_y = mm256_set1_ps(mat.c1.y);
    63.         c1_z = mm256_set1_ps(mat.c1.z);
    64.         c1_w = mm256_set1_ps(mat.c1.w);
    65.  
    66.         c2_x = mm256_set1_ps(mat.c2.x);
    67.         c2_y = mm256_set1_ps(mat.c2.y);
    68.         c2_z = mm256_set1_ps(mat.c2.z);
    69.         c2_w = mm256_set1_ps(mat.c2.w);
    70.  
    71.         c3_x = mm256_set1_ps(mat.c3.x);
    72.         c3_y = mm256_set1_ps(mat.c3.y);
    73.         c3_z = mm256_set1_ps(mat.c3.z);
    74.         c3_w = mm256_set1_ps(mat.c3.w);
    75.     }
    76. }
    77.  
    78. [StructLayout(LayoutKind.Explicit)]
    79. public struct Float3V256
    80. {
    81.     [FieldOffset(0)] public v256 x;
    82.     [FieldOffset(32)] public v256 y;
    83.     [FieldOffset(64)] public v256 z;
    84. }
    85.  
    86. [StructLayout(LayoutKind.Explicit)]
    87. public struct Float4V256
    88. {
    89.     [FieldOffset(0)] public v256 x;
    90.     [FieldOffset(32)] public v256 y;
    91.     [FieldOffset(64)] public v256 z;
    92.     [FieldOffset(96)] public v256 w;
    93. }
    94.  
    95. [StructLayout(LayoutKind.Explicit)]
    96. public struct AABB8V256
    97. {
    98.     [FieldOffset(0)] public Float3V256 min;
    99.     [FieldOffset(96)] public Float3V256 max;
    100. }
    101.  
    102. [MethodImpl(MethodImplOptions.AggressiveInlining)]
    103. public static Float4V256 TransformPoint(in Float4x4V256 matrix, in Float3V256 vec)
    104. {
    105.     return new Float4V256
    106.     {
    107.         x = mm256_fmadd_ps(matrix.c0_x, vec.x, mm256_fmadd_ps(matrix.c1_x, vec.y, mm256_fmadd_ps(matrix.c2_x, vec.z, matrix.c3_x))),
    108.         y = mm256_fmadd_ps(matrix.c0_y, vec.x, mm256_fmadd_ps(matrix.c1_y, vec.y, mm256_fmadd_ps(matrix.c2_y, vec.z, matrix.c3_y))),
    109.         z = mm256_fmadd_ps(matrix.c0_z, vec.x, mm256_fmadd_ps(matrix.c1_z, vec.y, mm256_fmadd_ps(matrix.c2_z, vec.z, matrix.c3_z))),
    110.         w = mm256_fmadd_ps(matrix.c0_w, vec.x, mm256_fmadd_ps(matrix.c1_w, vec.y, mm256_fmadd_ps(matrix.c2_w, vec.z, matrix.c3_w))),
    111.     };
    112. }
    113.  
    114. [MethodImpl(MethodImplOptions.AggressiveInlining)]
    115. public static Float3V256 PerspectiveDivision(in Float4V256 vec)
    116. {
    117.     var w_rcp = mm256_rcp_ps(vec.w);
    118.     return new Float3V256
    119.     {
    120.         x = mm256_mul_ps(vec.x, w_rcp),
    121.         y = mm256_mul_ps(vec.y, w_rcp),
    122.         z = mm256_mul_ps(vec.z, w_rcp),
    123.     };
    124. }
    125.  
    126. [MethodImpl(MethodImplOptions.AggressiveInlining)]
    127. public static AABB8V256 CreateClipSpaceAABB(Float4x4V256 mat, AABB8V256 aabb_in_world)
    128. {
    129.     var aabb_in_screen = new AABB8V256
    130.     {
    131.         min = new Float3V256
    132.         {
    133.             x = mm256_set1_ps(float.MaxValue),
    134.             y = mm256_set1_ps(float.MaxValue),
    135.             z = mm256_set1_ps(float.MaxValue),
    136.         },
    137.         max = new Float3V256
    138.         {
    139.             x = mm256_set1_ps(float.MinValue),
    140.             y = mm256_set1_ps(float.MinValue),
    141.             z = mm256_set1_ps(float.MinValue),
    142.         }
    143.     };
    144.  
    145.     var max_mask = mm256_set1_epi32(-1);
    146.     var min_mask = mm256_setzero_si256();
    147.  
    148.     for (int i = 0; i < 8; i++)
    149.     {
    150.         // Create all combinations of 8 vertices making up the bounding box
    151.         var x_mask = (i & (1 << 2)) == 0 ? min_mask : max_mask;
    152.         var y_mask = (i & (1 << 1)) == 0 ? min_mask : max_mask;
    153.         var z_mask = (i & (1 << 0)) == 0 ? min_mask : max_mask;
    154.         var v = new Float3V256
    155.         {
    156.             x = mm256_blendv_ps(aabb_in_world.min.x, aabb_in_world.max.x, x_mask),
    157.             y = mm256_blendv_ps(aabb_in_world.min.y, aabb_in_world.max.y, y_mask),
    158.             z = mm256_blendv_ps(aabb_in_world.min.z, aabb_in_world.max.z, z_mask),
    159.         };
    160.  
    161.         var v_in_clip = TransformPoint(mat, v);
    162.         var v_in_clip_normalized = PerspectiveDivision(v_in_clip);
    163.  
    164.         aabb_in_screen.min.x = mm256_min_ps(aabb_in_screen.min.x, v_in_clip_normalized.x);
    165.         aabb_in_screen.min.y = mm256_min_ps(aabb_in_screen.min.y, v_in_clip_normalized.y);
    166.         aabb_in_screen.min.z = mm256_min_ps(aabb_in_screen.min.z, v_in_clip_normalized.z);
    167.  
    168.         aabb_in_screen.max.x = mm256_min_ps(aabb_in_screen.max.x, v_in_clip_normalized.x);
    169.         aabb_in_screen.max.y = mm256_min_ps(aabb_in_screen.max.y, v_in_clip_normalized.y);
    170.         aabb_in_screen.max.z = mm256_min_ps(aabb_in_screen.max.z, v_in_clip_normalized.z);
    171.     }
    172.  
    173.     return aabb_in_screen;
    174. }
    175.  
    176. [BurstCompile]
    177. public struct RefactoredJob
    178.     : IJobParallelFor
    179. {
    180.     [ReadOnly]
    181.     public float4x4 World_to_Clip;
    182.  
    183.     [ReadOnly]
    184.     public NativeArray<AABB8> AABB_in_World;
    185.  
    186.     [WriteOnly]
    187.     public NativeArray<AABB8> AABB_in_Clip;
    188.  
    189.     public unsafe void Execute(int index)
    190.     {
    191.         var tmp = AABB_in_World[index];
    192.         var aabb_in_world = new AABB8V256
    193.         {
    194.             min = new Float3V256
    195.             {
    196.                 x = mm256_loadu_ps(&tmp.min.x),
    197.                 y = mm256_loadu_ps(&tmp.min.y),
    198.                 z = mm256_loadu_ps(&tmp.min.z),
    199.             },
    200.             max = new Float3V256
    201.             {
    202.                 x = mm256_loadu_ps(&tmp.max.x),
    203.                 y = mm256_loadu_ps(&tmp.max.y),
    204.                 z = mm256_loadu_ps(&tmp.max.z),
    205.             }
    206.         };
    207.  
    208.         var world_to_clip = new Float4x4V256(World_to_Clip);
    209.         var aabb_in_screen = CreateClipSpaceAABB(world_to_clip, aabb_in_world);
    210.  
    211.         var res = (AABB8)default;
    212.         mm256_storeu_ps(&res.min.x, aabb_in_screen.min.x);
    213.         mm256_storeu_ps(&res.min.y, aabb_in_screen.min.y);
    214.         mm256_storeu_ps(&res.min.z, aabb_in_screen.min.z);
    215.  
    216.         mm256_storeu_ps(&res.max.x, aabb_in_screen.max.x);
    217.         mm256_storeu_ps(&res.max.y, aabb_in_screen.max.y);
    218.         mm256_storeu_ps(&res.max.z, aabb_in_screen.max.z);
    219.         AABB_in_Clip[index] = res;
    220.     }
    221. }
    222.  
     
    chadfranklin47 likes this.
  2. Mortuus17

    Mortuus17

    Joined:
    Jan 6, 2020
    Posts:
    105
    I've written libraries whose entire purpose is to abstract away intrinsics and providing fallback code paths, but when it comes to system code that is time sensitive, there is almost no way you're gonna be able to dismantle and abstract away the entire thing. It is usually highly spezialized code and abstractions often - but not always - reduce performance. So being able to code in assembly directly and calling the code from burst or at the very least using intrinsics is a valuable skill to have. And tbh the code looks too spezialized for some other guy to refactor it without knowing the underlying math; reverse engineering takes too long for my taste.

    I just want to mention that using pointers or the
    in
    parameter modifier (which is just saying that it's a pointer) is not only redundant with small functions (inlining), but rather possibly confuses the compiler when there are two
    in
    parameters. They are simply
    void*
    and thus the compiler has to assume the possibility of aliasing. Every compiler engineer will tell you to pass by value where possible, and a 64 byte struct (
    float4x4
    ) is imo right at the edge of still being reasonable.
     
    chadfranklin47 likes this.
  3. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    I saw some of your library maxmath and to abstract intrinsics away for every problem at hand is like you say. Its hard to abstract intrinsics away. Im busy with a script engine that would take a script like above and returns bytecode that moslty interprets into simd code through il2cpp/burst because it leverages the fact that the script is used n times, it executes each step in the script as a vector that is easily optimized once for each step. The cost of interpreting is neglegible if the dataset is large enough. It makes each complicated problem a list of easily vectorized steps voiding the need to really spend a lot of time on each part that is performance critical. It is hard to setup its use though and far from finished..
     
  4. vectorized-runner

    vectorized-runner

    Joined:
    Jan 22, 2018
    Posts:
    383
    Hi, can you elaborate more on this issue? I thought passing big structs by 'in' would be more performant and Burst knows that components passed by in are not aliasing? Especially matrices are really big, I pass them as ref or in most of the time.

    I wonder this since the math library also doesn't use pass by ref or in

    Edit: I might be wrong about Burst knowing 'in' not inlining, but I've read it prevented defensive copies
     
  5. Mortuus17

    Mortuus17

    Joined:
    Jan 6, 2020
    Posts:
    105
    That's quite nice :) Do you know how much RAM the compiler takes up at runtime?
    Although there is still the problem that a lot of code, that could be vectorized (sometimes quite easily) by a human, cannot be vectorized by a compiler. Especially branches ruin vectorization. But once your engine stands, it will only ever become better over time (because the compiler you communicate with will become better). I'd probably not use it for this reason aswell as me actually liking the task of low level optimization by hand.

    I don't know of Burst knowing anything about
    in
    . I remember xoof talking about the need for using it in Unity.Mathematics for code that's not burst compiled, though. In the end and in simple terms, Burst doesn't end up compiling your code, it transpiles it to LLVM IR, which only sees a
    void*
    (again, if no magic happens with
    in
    during transpilation and I don't see how you'd do that other than passing by value instead. The compiler then sees dead reads and writes and inlines small functions instead).
    It can be more performant in some edge cases but even for experienced users there are too many possibilites to mess it up, actually causing defensive copies, where it otherwise wouldn't happen. Since we're talking structs here, I'd rather use
    unsafe
    code and use pointers myself. But usually DOD inherently forbids large structs anyways, so it shouldn't come up often. As for matrices in Burst code (meaning code where performance matters...), the worst you could get would be 8 SIMD registers having to be pushed/popped (
    double4x4
    ). Not too bad.

    Here's a little post on
    in
    i like a lot:
    https://devblogs.microsoft.com/premier-developer/the-in-modifier-and-the-readonly-structs-in-c/
     
  6. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    Very little more then the node tree it generates, its a very simple compiler as it only does a simple transformations on the script; it removes nesting and determines datatypes but thats about it, its interpretor for a single datastream does little simd (although its very fast) but when fed multiple streams of data its trivial to think how to rewrite

    Code (CSharp):
    1. a = b * c * 100 + extern(10, 2);
    into vectorized parts, i just executes each step with arrays of floats, so it becomes

    Code (CSharp):
    1. a[] = b[] * c[] * 100 + extern(10, 2)
    then the interpretor has a hotloop for every type of operation which then is easily compiled into whatever vector instructions are supported on the platform. It only works that way with multiple records of data input and it can do very interesting optimizatons considering it will only have to call the external extern once for the set as its parameters are constant.

    Dueue to how it works with burst/c# it does become a lot of code so large parts are created by a codegen that rewrites part of blast depending on configuration.

    Currently, its limited in its use, but that quickly changes.
     
  7. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    I do know that using "ref" on data types can sometimes be an optimization. I wrote a custom version of LocalToParentSystem that uses ref on the parent's LocalToWorld matrix in the recursion function. My version has additional arguments and does additional checks to help improve performance and correctness for when not all transforms change. However, even when all transforms change (worst case scenario), I was surprised to learn that my version performed 4% faster. So really it just comes down to profiling.
     
  8. Per-Morten

    Per-Morten

    Joined:
    Aug 23, 2019
    Posts:
    109
    Thanks for taking the time to give feedback. Seems like the "best" choice is to let it stay the way it is. Also, regarding in/ref parameters, I'm pretty sure I've seen situations where I've not gotten the result I've wanted unless I passed by ref/in even when the function was inlined, I'm not tossing the around everywhere :)
     
  9. R2-RT

    R2-RT

    Joined:
    May 8, 2019
    Posts:
    38
    Mortuus17 likes this.