Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

Question IJobParallelForTransform and Burst?

Discussion in 'Burst' started by Nyanpas, Aug 17, 2020.

  1. Nyanpas

    Nyanpas

    Joined:
    Dec 29, 2016
    Posts:
    406
    Greetings,
    I am attempting to set up a TransformJob, and am curious to see if/how it will support the Burst, however I have one issue currently, and it is this line:

    Code (csharp):
    1. transform.localRotation = math.mul(StartRotation,
    2. math.mul(quaternion.Euler(0f, _eulers.y, 0f), quaternion.Euler(_eulers.x, 0f, _eulers.z)));
    It hereby states as so in the Burst Inspector .VM IR Optimisation Diagnostics window:
    When I observed the quaternion.cs-code at the GitHub-repositorie, I couldn't find exactly where the
    s e l e c t
    is, except for in LookRotation and the safe option(s).

    So, what am I doing wrong? Please help. ;w;
     
  2. RyancHaynes

    RyancHaynes

    Joined:
    Dec 8, 2018
    Posts:
    11
    Nyanpas likes this.
  3. Nyanpas

    Nyanpas

    Joined:
    Dec 29, 2016
    Posts:
    406
    I have done a lot more testing now and it seems to come down to IJobParallelFor not supporting Burst? I even tried it empty, and it could not vectorize. I keep getting
    loop control flow is not understood by vectorizer
    even in an IJobParallelFor now? Error code is
    unknown:0:0
    by the by.
     
  4. Nyanpas

    Nyanpas

    Joined:
    Dec 29, 2016
    Posts:
    406
    Here is an example of a simple IJobParallelFor that is not understood by the vectorizer:
    Code (CSharp):
    1.     [BurstCompile]
    2.     public struct ForJob : IJobParallelFor
    3.     {
    4.         [ReadOnly]
    5.         public NativeArray<float3> InPosition;
    6.  
    7.         [WriteOnly]
    8.         public NativeArray<float3> OutPosition;
    9.  
    10.         [ReadOnly]
    11.         public float DeltaTime;
    12.  
    13.         public void Execute(int index)
    14.         {
    15.             OutPosition[index] = OutPosition[index] + InPosition[index] * DeltaTime;
    16.         }
    17.     }
    The Burst Inspector .VM IR Optimisation Diagnostics window responds henceforth with:
    Remark: unknown:0:0: loop not vectorized: loop control flow not understood by vectorizer


    What am I missing?
     
  5. Nyanpas

    Nyanpas

    Joined:
    Dec 29, 2016
    Posts:
    406
    Ok, just tried the "successful vectorization" example from the SlideShare:
    ahahahahahaha.PNG
    Code here:
    Code (CSharp):
    1.     [BurstCompile]
    2.     public struct VectorizeDemo : IJob
    3.     {
    4.         public NativeArray<int> Inputs;
    5.         public NativeArray<int> Outputs;
    6.  
    7.         public void Execute()
    8.         {
    9.             for(int i = 0; i < Inputs.Length; ++i)
    10.             {
    11.                 if(Inputs[i] >= 0)
    12.                 {
    13.                     Outputs[i] = Inputs[i];
    14.                 }
    15.                 else
    16.                 {
    17.                     Outputs[i] = 0;
    18.                 }
    19.             }
    20.         }
    21.     }
    And it still gets:
    Remark: unknown:0:0: loop not vectorized: loop control flow not understood by vectorizer


    What is going on?
     
  6. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,984
    For the example, uncheck Safety Checks in the Burst inspector.

    For everything else, I have never been able to get the autovectorizer to kick in when using types defined by Mathematics.

    Keep in mind Burst still makes a huge difference even for scalar code, so unless this is a real bottleneck, I wouldn't fret too much about it.
     
    Nyanpas likes this.
  7. Nyanpas

    Nyanpas

    Joined:
    Dec 29, 2016
    Posts:
    406
    It was a bit confusing to see how the "Safety Checks"-option works because I couldn't directly see the relation...
    This is unfortunately the bottleneck of the project. It's the system that assigns rotations for the animation system which will be used for a few thousand NPCs' and the player's joints. This is the only significant system in the game that will run every frame, however I need the frametime bandwidth to also allow for AI state machine updates (runs 4 times per second) and culling/other graphics jobs (like mesh creation).

    I have looked into vertex animation for distant NPC but unfortunately it has limits as it is not as dynamic. It would have to be switched between, and making a system for this could potentially be a little out of scope for now, but if there is no other option I might look into it next.
     
  8. sheredom

    sheredom

    Unity Technologies

    Joined:
    Jul 15, 2019
    Posts:
    300
    Couple of things:
    1. If you are using an already pre-vectorized type the compiler will generally not vectorize the loop. This is because the cost of undoing your vectorization (using a float3) and turning it into the 'proper' vectorized type (float4/float8) would outweigh any benefits.
    2. You can see with the Burst inspector that the the body of your ForJob is using the vector unit (vmulps -> vaddps).
    3. For the VectorizeDemo job - @DreamingImLatios is correct that the safety checks are what is causing the vectorization to be disabled there. We've got some longer term plans to try and make LLVM understand this, but at present its a really thorny issue in the compiler to work around.
    For 1. there is one additional workaround you could do - if you change when you schedule the ForJob, if you specify the arrayLength * 3, you can do the following:

    Code (CSharp):
    1. [BurstCompile]
    2.     public struct ForJob : IJobParallelFor
    3.     {
    4.         [ReadOnly]
    5.         public NativeArray<float3> InPosition;
    6.  
    7.         [WriteOnly]
    8.         public NativeArray<float3> OutPosition;
    9.  
    10.         [ReadOnly]
    11.         public float DeltaTime;
    12.  
    13.         public void Execute(int index)
    14.         {
    15.             // when you schedule the job, remember to do arrayLength * 3!
    16.             var actualIn = InPosition.Reinterpret<float>(UnsafeUtility.SizeOf<float3>());
    17.             var actualOut = OutPosition.Reinterpret<float>(UnsafeUtility.SizeOf<float3>());
    18.  
    19.             actualOut[index] = actualOut[index] + actualIn[index] * DeltaTime;
    20.         }
    21.     }
    Which turns the loop into:

    Code (CSharp):
    1. .LBB0_11:
    2. .Ltmp11:
    3.        
    4.         .cv_inline_site_id 1 within 0 inlined_at 1 0 0
    5.         === MathTest.cs(506, 1)            actualOut[index] = actualOut[index] + actualIn[index] * DeltaTime;
    6.         vmulps        ymm2, ymm1, ymmword ptr [rsi + 4*rax - 96]
    7.         vmulps        ymm3, ymm1, ymmword ptr [rsi + 4*rax - 64]
    8.         vmulps        ymm4, ymm1, ymmword ptr [rsi + 4*rax - 32]
    9.         vmulps        ymm5, ymm1, ymmword ptr [rsi + 4*rax]
    10.         vaddps        ymm2, ymm2, ymmword ptr [rdi + 4*rax - 96]
    11.         vaddps        ymm3, ymm3, ymmword ptr [rdi + 4*rax - 64]
    12.         vaddps        ymm4, ymm4, ymmword ptr [rdi + 4*rax - 32]
    13.         vaddps        ymm5, ymm5, ymmword ptr [rdi + 4*rax]
    14. .Ltmp12:
    15.         vmovups        ymmword ptr [rdi + 4*rax - 96], ymm2
    16.         vmovups        ymmword ptr [rdi + 4*rax - 64], ymm3
    17.         vmovups        ymmword ptr [rdi + 4*rax - 32], ymm4
    18.         vmovups        ymmword ptr [rdi + 4*rax], ymm5
    19.         add        rax, 32
    20.         cmp        rbx, rax
    21.         jne        .LBB0_11
    With safety checks off. It's not the prettiest code but it should get you an additional 25% perf on SSE because you are using all 4 vector elements of each mul/add pair. It's on my wish list that LLVM would improve its vectorization to be able to automatically do the above transformation for y'all, but at present this is the best we've got :)
     
    MaxEden and Nyanpas like this.
  9. Nyanpas

    Nyanpas

    Joined:
    Dec 29, 2016
    Posts:
    406
    Thank you so much for this. :3c

    The reason for the float3 is they are 3D-positions or Euler-rotations for objects. However, would it be more "efficient" to just use float4 and leave the .w at 0, using just .xyz?
     
  10. sheredom

    sheredom

    Unity Technologies

    Joined:
    Jul 15, 2019
    Posts:
    300
    So if you wanted to get really optimal by default then you'd want to split out the .x's, .y's and .z's into separate Native Arrays of data - that's the most optimal way to deal with data in a data oriented fashion.

    That being said, for this example I think the float3's are fine, and using my... hack? Above will get you a bit more performance out of the same code!
     
    Nyanpas likes this.
  11. Nyanpas

    Nyanpas

    Joined:
    Dec 29, 2016
    Posts:
    406
    Hmm, I have had a long think about data management for this now, and it seems that regardless of what I do there will unfortunately be a lot of floats floating around, but they could perhaps be grouped by something (but then we are back to why not just use float3-4s)? I am rather new to all of this and unfortunately kind of tied to Monobehaviours for legacy reasons.