Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice
  2. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

Single thread burst job is way faster than multithreaded burst job!

Discussion in 'Burst' started by calabi, Jan 25, 2021.

  1. calabi

    calabi

    Joined:
    Oct 29, 2009
    Posts:
    232
    So I have this job with an struct I'm using and getting weird behaviour or maybe it is normal. I'm calculating the below code on 1 million entities. Using Run() it completes in about 230ms, but using Schedule() it takes about 760ms, and using ScheduleParrellel it takes 315ms(and the profiler shows its using all 14 threads for that time. I know multithreaded jobs have an overhead for them being scheduled but this seems inordinately longer on multiple threads. Am I doing something wrong or just misunderstanding something?

    Code (CSharp):
    1. Entities.WithName("FindNeigbourCell").ForEach((DynamicBuffer<FlowfieldNeigborEntities> flowbuff, in FlowFieldData fluffydata, in CellData tempcelldata) =>
    2.         {
    3.             NativeList<int2> tempint2list = new NativeList<int2>(Allocator.Temp);
    4.  
    5.             tempGetNeigbourindices.Execute(tempcelldata.gridIndex, fluffydata.gridSize, ref tempint2list);
    6.  
    7.             for (int i = 0; i < tempint2list.Length; i++)
    8.             {
    9.                 flowbuff.Add(tempint2list[i]);
    10.             }
    11.  
    12.             tempint2list.Dispose();
    13.  
    14.         }).Run();
    15.  
    16. public struct GetNeighborIndices
    17. {
    18.     public int2 None;
    19.     public int2 North;
    20.     public int2 South;
    21.     public int2 East;
    22.     public int2 West;
    23.     public int2 NorthEast;
    24.     public int2 NorthWest;
    25.     public int2 SouthEast;
    26.     public int2 SouthWest;
    27.  
    28.     //public NativeArray<int2> thegroddirections;
    29.     public static readonly int2[] thegroddirections = new int2[4] {new int2(0, 1), new int2(0, -1), new int2(1, 0), new int2(-1, 0) };
    30.  
    31.     //public NativeList<int2> results;
    32.  
    33.     public NativeList<int2> Execute(int2 originIndex, int2 gridSize, ref NativeList<int2> results)
    34.     {
    35.         //thegroddirections = new NativeArray<int2>();
    36.         //NativeList<int2> = new NativeList<int2>()
    37.         //thegroddirections = new int2[4]();
    38.  
    39.         //None = new int2(0, 0);
    40.         North = new int2(0, 1);
    41.         South = new int2(0, -1);
    42.         East = new int2(1, 0);
    43.         West = new int2(-1, 0);
    44.         //NorthEast = new int2(1, 1);
    45.         //NorthWest = new int2(-1, 1);
    46.         //SouthEast = new int2(1, -1);
    47.         //SouthWest = new int2(-1, -1);
    48.  
    49.         //thegroddirections[0] = None;
    50.         //thegroddirections[0] = North;
    51.         //thegroddirections[1] = South;
    52.         //thegroddirections[2] = East;
    53.         //thegroddirections[3] = West;
    54.  
    55.         //int2[] thegroddirections = new int2[] { };
    56.         //thegroddirections[5] = NorthEast;
    57.         //thegroddirections[6] = NorthWest;
    58.         //thegroddirections[7] = SouthEast;
    59.         //thegroddirections[8] = SouthWest;    
    60.  
    61.         foreach (int2 curDirection in thegroddirections)
    62.         {
    63.             int2 neighborIndex = GetIndexAtRelativePosition(originIndex, curDirection, gridSize);
    64.  
    65.             if (neighborIndex.x >= 0)
    66.             {
    67.                 results.Add(neighborIndex);
    68.             }
    69.         }
    70.         return results;
    71.     }
    72.  
    73.     private int2 GetIndexAtRelativePosition(int2 originPos, int2 relativePos, int2 gridSize)
    74.     {
    75.  
    76.         int2 finalPos = originPos + relativePos;
    77.         if (finalPos.x < 0 || finalPos.x >= gridSize.x || finalPos.y < 0 || finalPos.y >= gridSize.y)
    78.         {
    79.             return new int2(-1, -1);
    80.         }
    81.         else
    82.         {
    83.             return finalPos;
    84.         }
    85.     }
    86. }
    87.  
     
  2. Razmot

    Razmot

    Joined:
    Apr 27, 2013
    Posts:
    346
    yes you are doing too little processing in there for multithread to be useful :)
    but if you use withBurst() and you replace your ifs by ternary operators its s gonna be way faster !
    eg
    Code (CSharp):
    1.  
    2.  
    3. int2 neighborIndex;
    4.             foreach (int2 curDirection in thegroddirections)            //might have to use for with burst
    5.             {
    6.                 neighborIndex = originPos + relativePos;
    7.                 neighborIndex = neighborIndex.x < 0 || neighborIndex.x >= gridSize.x || neighborIndex.y < 0 || neighborIndex.y >= gridSize.y ? neighborIndex : int2(-1, -1);
    8.                 results.Add(neighborIndex);//just discard the -1,-1 in the calling method
    9.             }
     
    calabi likes this.
  3. calabi

    calabi

    Joined:
    Oct 29, 2009
    Posts:
    232
    Thanks for the reply, I figured it was something I was doing. I'd heard if's were not optimal for burst but wasn't sure what to replace them with, I guess I'll have to look more into how to use Burst properly.
     
  4. calabi

    calabi

    Joined:
    Oct 29, 2009
    Posts:
    232
    charleshendry likes this.
  5. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,222
    If you aren't willing to learn assembly, then just make sure that assembly code is generated and move on until you have performance issues.

    Burst alone doesn't explain the performance discrepancies you are seeing with the different approaches. Screenshots of the profiler or profiler dumps help a lot.
     
    calabi likes this.
  6. calabi

    calabi

    Joined:
    Oct 29, 2009
    Posts:
    232
    I did the above changes as recommended by Razmat, and it didn't bring much of a difference to be honest. These are the profiler results I don't know if these help. But the first image is done using Run(), second is Schedule() and third is ScheduleParrallel().


     
  7. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,222
    None of those are running with Burst. That's probably because you are profiling the first execution of the job in the Editor, before Burst has had a chance to compile the job.
     
    calabi likes this.
  8. calabi

    calabi

    Joined:
    Oct 29, 2009
    Posts:
    232
    That's kind of confusing I had no idea that's how it worked its hard to find things in the profiler.

    I think I've found them?




    The schedule() seems much better, but the parallel() still seems a bit long(if that is the correct one).
     
  9. jasons-novaleaf

    jasons-novaleaf

    Joined:
    Sep 13, 2012
    Posts:
    181
    I find it pretty hard to read your code, but if I'm understanding correctly, you are doing some accumulation of all neighboring cells into your cell.

    If that's true, you are blowing cache, and probably not going to get very accurate results due to race conditions.

    Also an
    int2[]
    isn't going to be burst compileable is it? as it's a c# array (object) not a struct.

    so I stared at your code a bit longer.

    What does
    tempGetNeigbourindices.Execute()
    do? it's not shown in your snippet.

    Also how many times does
    Entities.WithName("FindNeigbourCell").ForEach()
    loop every frame? You are doing an
    Allocator.Temp
    inside it. Maybe use Allocator.TempJob? Better would be to move the allocation outside of your ForEach() entirely. I bet that allocation is your problem.
     
    Last edited: Jan 26, 2021
  10. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,222
    Alright. That is strange behavior.

    So what version of Burst are you using? And how many entities do you have? Also are safety checks, leaks detection, and the jobs debugger enabled while you are performance-testing? Turn those off if they are.

    I'm suspicious that temp allocator is using a fallback allocator at a high frequency, and that fallback allocator is locking resources which is causing thread contention. Since there is a very small number of elements in your readonly array, I would try using FixedList64<int2> instead of allocating a NativeList. This will allocate a list on the stack that can store up to 7 items, which is more than enough for your code.
     
    Timboc and calabi like this.
  11. Razmot

    Razmot

    Joined:
    Apr 27, 2013
    Posts:
    346
    you need to do WithBurst().Run()
     
  12. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,222
    Not true. Burst is assumed when possible, and the second set of profile captures clearly show Burst running the job.
     
    Razmot, calabi and Lukas_Kastern like this.
  13. Enzi

    Enzi

    Joined:
    Jan 28, 2013
    Posts:
    954
    Run being faster than Schedule with a low entity amount was acknowledged by the DOTS team a long time ago and they are working on it.

    Bursted ForEach lambda jobs are not showing up with a (Burst) tag in the profiler but I trust that it is bursted code.

    For a better analysis what's going on you'd have to "Deep Profile" and see where the time is going. If it's just a bunch of internal Entities code for getting the data there's not much you can do than wait for thing to get better.
    I have only skimmed the code, I don't think allocation is a problem but it can surely be optimized further in case the allocation is indeed triggering the fallback allocator like @DreamingImLatios mentioned.
     
    Last edited: Jan 26, 2021
  14. calabi

    calabi

    Joined:
    Oct 29, 2009
    Posts:
    232

    Wow! That made a difference I just did DreamingImLatios recommendation and changed Nativelist to FixedList64 and there is the result. I guess I'll have to be careful with how I allocate data inside a job from now on. Thanks a lot @DreamingImLatios.
     
    DreamingImLatios, Enzi and Timboc like this.
  15. Enzi

    Enzi

    Joined:
    Jan 28, 2013
    Posts:
    954
    Wow, that's a pretty drastic jump. I personally never used a NativeList in jobs, so good to know it's pretty much a no-go.
     
  16. jdtec

    jdtec

    Joined:
    Oct 25, 2017
    Posts:
    302
    I know that math.select is faster when you can use it, however, I thought ternary operators would just produce similar assembly to an if statement anyway, is that not the case?
     
  17. Razmot

    Razmot

    Joined:
    Apr 27, 2013
    Posts:
    346
    nope, I can confirm the ternary produces the same burst code than select - it was in the release notes of burst at a point, and I use it.

    It does evaluate both sides of the ? like with select, but I've yet to see a case where evaluating both statements is less performant than branching !

    btw burst does make use of
    [MethodImpl(MethodImplOptions.AggressiveInlining)] too and it can make a huge difference in a core loop - burst inlines a lot without it but I had a 8 lines method doing some maths that was not inlined without the attribute

    and use ref / in parameters for anything bigger than an float with a 64bit cpu ( like float2 , int3 ... )
     
    charleshendry and jdtec like this.
  18. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,222
    Just to clear some things up:
    Using NativeList by itself wasn't the issue. The issue was allocating a new NativeList for each entity for a large amount of entities. This was a very technical issue and by no means a rookie mistake. There is still room for improvement, but the code is now in "Performance by Default" territory which may be "good enough".

    As for ternary operators, sometimes Burst compiles them the same as math.select, sometimes it doesn't. It really depends on the generated IL and Burst version used. math.select is the most surefire tool when it applies.
     
    SenseEater, charleshendry and Razmot like this.
  19. calabi

    calabi

    Joined:
    Oct 29, 2009
    Posts:
    232
    I finally graduated from non rookie mistakes, awesome, :D thanks DreamingImLatios. (until I make the next rookie mistake) . :(
     
  20. Antypodish

    Antypodish

    Joined:
    Apr 29, 2014
    Posts:
    10,754
    jasons-novaleaf likes this.
  21. jdtec

    jdtec

    Joined:
    Oct 25, 2017
    Posts:
    302
    Well that was nice, I rewrote some if() style branching code for a GetGridNeighbours function to use math.select instead and got a surprising (to me) 10* speed-up in burst!
     
    Razmot likes this.
  22. Micz84

    Micz84

    Joined:
    Jul 21, 2012
    Posts:
    448
    calabi and jdtec like this.
  23. MartinTilo

    MartinTilo

    Unity Technologies

    Joined:
    Aug 16, 2017
    Posts:
    2,431
    Hi there, you might want to try the script I posted as a workaround to a more integrated search and filter workflow (which is on our feature request backlog) into this thread.

    Also, Flow Event visualization should help find things belonging to the same schedule - execute - sync flow.
     
    calabi likes this.
  24. calabi

    calabi

    Joined:
    Oct 29, 2009
    Posts:
    232
    I just tried the script it works great, thanks a lot. I can't wait to see the new stuff.
     
    MartinTilo likes this.