Search Unity

Could Unity add in editor tips and hints on how best to write Burst code?

Discussion in 'Burst' started by Arowx, Dec 16, 2018.

  1. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    CPUs have batch based SIMD instruction sets that can literally run multiple vector ops a cycle, and in theory the Burst compiler should be able to take advantage of these features.

    The thing is as a programmer I don't have any idea of what Burst does or cannot do with the code I give it...

    Therefore if the super clever Burst compiler gave me the programmer some hints or tips based on how well my code compiled or how it could be improved to take advantage of SIMD instruction sets and other optimizations then wouldn't that massively improve what ECS can do or I can do with ECS?

    Burst is a black box technology to most of us, there will be a few clever programmers who dig into the code it generates and learn to tweak their code to get the best from it, but for the rest of us OK programmers some context relevant hints or tips in editor as feedback would be ideal.

    Also as ECS is so complex would some ECS tips for common misconceptions/mistakes also be useful for getting people up to speed...

    At least until you bring out the Visual ECS programming where we just present the problem we want ECS to solve in graphical form and you do the rest...
     
  2. 5argon

    5argon

    Joined:
    Jun 10, 2013
    Posts:
    1,555
    I would like to see some examples of convertible bad ECS code pattern, and what would be the tips given for that pattern.

    All I can think of are already in the form of library limitations that prevents you from going that way in the first place, like HPC# enforcement or aliasing prevention allowing good assembly to be made.

    Then Burst can vectorize NativeArray iteration loop automatically, which basically SIMD the linear access for you. What other kind of automatic SIMD optimization would you like to have? Like if you have 3 floats adding with the same value and be given a tip to combine it into float3 in the first place? (It would be quite bizzarre if it tells me to change my data structure design)
     
    Lurking-Ninja likes this.
  3. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Check out this thread https://forum.unity.com/threads/branch-misprediction-in-systems.594241/

    Here a developer is looking into how to optimize their ECS system, they get some feedback and benchmark a few different approaches. I ask for them to share their code and they do they are getting the distance of between two points then normalizing this direction vector, so they are doing the four calculations when they only need to do 3 (direction, magnitude, normalize by division).

    This is in the inner loop of an ECS system their original code had function calls within this inner loop.

    What if ECS warned you that calling a function has a big overhead that can reduce the performance of your system, or repeating the same calculations is less than optimal.

    It's just good optimization tips and tricks that it takes time to learn and more time to test and profile to ensure they improve the compiled codes performance, however the developers of the Burst compiler probably know every trick and tip in the book the trouble is we don't so we can be giving the Burst compiler messy and slow code that it would struggle to optimize.

    If you are a good or great programmer then you won't need any hints or tips as you will provide the Burst compiler with streamlined data and inline vectorised code that it will make great SIMD code from, for the rest of us a few tips and hints could mean the difference between thousands of cool effects and millions.
     
    Sylmerria and NotaNaN like this.
  4. 5argon

    5argon

    Joined:
    Jun 10, 2013
    Posts:
    1,555
    Ok, I had been to that thread. I think those optimal optimizations are pretty specific to the problem? And we understand it because we are human and we know the overall context. How can a machine tell if the function call (which could get unknown amount of variable data to work on) is big or small before running other than preventing function calls from happening at all, which is very drastic. Then we can't use properties too, and also indexers are properties and properties are functions.

    Plus Burst already inline things, so function call not excluded from HPC# subset seems to be an intended behaviour that is still performant by default.

    How to determine a workload of a code without actually running? What is the definition of "inner/tight loop" that is possible to detect from assemblies, without human looking on it?

    And if it is possible to develop algorithm that could tell in any algorithm which repeated calculations (that is not consisted of constant values) are a wasted mistake instead of intended behaviour of the algorithm, that alone could be a big thesis paper. Imagine throwing that to optimize everything... even within HPC# constraint we could still get outside data from native containers. It is still uncertain until runtime if it is a wasted calculation or not.

    I am interested to know some of the "every trick and tip in the book the trouble is" that is detectable by pattern/machine so I can at least avoid them manually first. I believe this area still requires human support. For a machine to be able to do it, other possibilty maybe train a neural network with bad patterns until it magically find one for us but then NN is good for results but not good for explaining why.
     
    Last edited: Dec 16, 2018
  5. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    1,154
    I would be happy if the burst inspector could print the cpu cycles for the generated assembler lines
     
    eizenhorn, FROS7 and 5argon like this.
  6. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    This kind of feedback summary with hardware SIMD options/instructions that Burst could use if the program takes advantage of a different approach or uses a Burst optimization.
     
  7. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    1,154
    I am expecting that the compiler does all the optimizations for me without much feedback. There will be of course a limit to this (I am not expecting magic).

    If we could see the cpu cycles (intelligently summarized for branches / code paths) we could however, as an alternative to profiling, assess the effect of changes to our c# code on final assembly (if we don’t want/can to go there directly)
     
  8. 5argon

    5argon

    Joined:
    Jun 10, 2013
    Posts:
    1,555
    Since C# jobs are self contained to some degree, I think Unity-specific compiler optimization to allow rapid iteration on designing jobs is very possible. Like a super-incremental compiler which knows not just assembly level but job/struct level. Imagine editing a job code and the assembly updates instantly just like you can hot edit shader file and see the changes instantly. Dock a small panel locking to that job struct, set a reference compile and see before-after cpu cycles. That would be a dream to work on. Currently I had to take screenshot and compare. Number of cycles would look much more objective.
     
    pvloon, JesOb and sngdan like this.
  9. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    1,154
    Exactly this.
     
  10. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    1,154
    @Joachim_Ante

    Is adding cpu cycles to the assembly something you could consider - found myself today again in a situation, where like @5argon I have to copy code in text editor and go line by line....
     
  11. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    1,154
    here another example, where it would be nice to get some form of simple guidance, which one translates to the most optimized assembly...

    Code (CSharp):
    1.  
    2.     static bool IsColliding1 (Box b1, Box b2) {
    3.         float2 d = math.abs(b1.Center - b2.Center);
    4.         float2 e = b1.Extends + b2.Extends;  
    5.         return (math.all(d <= e));
    6.     }
    7.  
    8.     static bool IsColliding2 (Box b1, Box b2) {
    9.         float2 d = math.abs(b1.Center - b2.Center);
    10.         float2 e = b1.Extends + b2.Extends;
    11.         return ((d.x <= e.x) && (d.y <= e.y));
    12.     }
    13.  
    14.     static bool IsColliding3 (Box b1, Box b2) {
    15.         bool x = math.abs(b1.Center.x - b2.Center.x) <= (b1.Extends.x + b2.Extends.x);
    16.         bool y = math.abs(b1.Center.y - b2.Center.y) <= (b1.Extends.y + b2.Extends.y);
    17.         return x && y;
    18.     }
    19.  
    20.     static bool IsColliding4 (Box b1, Box b2) {
    21.         if (math.abs(b1.Center.x - b2.Center.x) > (b1.Extends.x + b2.Extends.x)) return false;
    22.         if (math.abs(b1.Center.y - b2.Center.y) > (b1.Extends.y + b2.Extends.y)) return false;
    23.         return true;
    24.     }

    edit: corrected a typo - thx to @M_R (math.any to math.all)
     
    Last edited: Jan 25, 2019
  12. Deleted User

    Deleted User

    Guest

    The only thing I have to go on for specifics is general knowledge and resharpers "clean code" extension.

    While there might be better sections where the compiler recognises a possible fault and should tell you at compile time rather than runtime, its not exactly something a system should be weighed down with. Only thing you could wait for is an actual presentation or document that suggests why something is not as optimal. Until then sticking with standard rules and laws help your code to stay efficient even if its not 1mhz more efficient.
    From what I remember hearing is that the burst compiler is optimised for specific types (such as the new unity.mathematics).

    Out of those it would look like the 1st option is the more optimised assembly, without properly writing it out just trying to imagine the instruction count . Of course that comes down to the machine speed as well since JMP commands can be expensive (where you are returning).
    Then theres accessing references, and accessing values when passing objects through other object references. Doing that in a return statement is pretty big. Additionally <= will create two instructions... Greater... and Equal, where the result needs to be greater AND equal.
     
  13. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    1,154
    Thank you for your reply - the last time i did something in assembler was on the c64. The world was simple there, because the hardware (cpu) was the same and you could go by instruction cycles (edit: no simd, single core, etc.). There was no difference in cpu cycles for < or >= branching (bcc / bcs) and I doubt there is today. (edit: and even if there was, this would be such a simple optimization that I would expect the compiler to take care of)

    Since we are not writing the assembly directly but have burst do it for us, my question is if there is a way the burst inspector could isolate the “execute” loop and either tell us a “measure” (like cpu cycles) for all branches — or simply for the executed one, and we have to write a test case for the branches ourself.

    I know it is micro optimization and maybe wishful thinking given the different architectures we can compile to. But it feels the burst inspector is so close already...

    Edit: A viewpoint could also be to say, use mathematics and floatX SIMD friendly instructions and then let Burst do its magic - or learn assembly.
     
    Last edited: Jan 25, 2019
  14. M_R

    M_R

    Joined:
    Apr 15, 2015
    Posts:
    559
    @sngdan you can inspect the burst--compiled assembly in the burst window and count the lines yourself (or write a script that does it, the window source is available in the burst package (
    BurstInspectorGUI
    ), maybe there is an API to get the parsed assembly)
     
  15. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    1,154
    @M_R yes, that is what I am doing now (diffnow) - but I am afraid counting lines is not the answer.

    It seams on my machine (mac) "IsCollliding2" outperforms the other variants both on IJob & IJobParallelFor - just measuring time
     
  16. M_R

    M_R

    Joined:
    Apr 15, 2015
    Posts:
    559
    just noticed you use
    math.any
    . that is an OR, you should use
    math.all
    instead
     
  17. Deleted User

    Deleted User

    Guest

    @sngdan interesting, thought the first one would have had less instructions; unless math.any expands into some mess. Might have to investigate, math.all will probably do the same but has escape clauses since the left side of && could be false.
     
  18. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    1,154
    @M_R - yes, math.any should have been math.all (I simplified the example for the forum and typed it wrong) - my test above works with math.all though ---- I will edit my post above to fix this
     
  19. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    1,154
    So I ran the test again in a build (not editor) the finding is:
    - Collision1 (math.all compilation) seems to have a significant performance penalty (+60%)
    - Collision2, 3, 4 run at almost the same speed (measured with stopwatch ticks)
    - Collision2 seems to be best at IJob and Collision4 best at IParallelFor (2-5% vs. 2nd place, not sure if this is really isolated to the jobs or influenced by other side effects)
     
  20. pvloon

    pvloon

    Joined:
    Oct 5, 2011
    Posts:
    591
    Something akin to BenmarkDotNet output for burst jobs (or, really, any function) integrated in unity would be a dream
     
  21. M_R

    M_R

    Joined:
    Apr 15, 2015
    Posts:
    559
    there is a
    com.unity.test-framework.performance
    package
     
  22. xoofx

    xoofx

    Unity Technologies

    Joined:
    Nov 5, 2016
    Posts:
    417
    Indeed, we have a plan to provide this exact user-friendly optimization guidance with the burst inspector/compiler you are looking for. This is a work we definitively want to do in a future release.
     
    Arowx, FROS7, Grizmu and 2 others like this.
  23. sngdan

    sngdan

    Joined:
    Feb 7, 2014
    Posts:
    1,154
    @xoofx great - good to know - this will be helpful for the final performance squeeze

    @M_R Thx - i tried to use the framework just now, but I got tons of namespace errors, etc... and did not want to experiment beyond the instructions that did not work for me :(