Could Unity add in editor tips and hints on how best to write Burst code?

Arowx · Dec 16, 2018

CPUs have batch based SIMD instruction sets that can literally run multiple vector ops a cycle, and in theory the Burst compiler should be able to take advantage of these features.

The thing is as a programmer I don't have any idea of what Burst does or cannot do with the code I give it...

Therefore if the super clever Burst compiler gave me the programmer some hints or tips based on how well my code compiled or how it could be improved to take advantage of SIMD instruction sets and other optimizations then wouldn't that massively improve what ECS can do or I can do with ECS?

Burst is a black box technology to most of us, there will be a few clever programmers who dig into the code it generates and learn to tweak their code to get the best from it, but for the rest of us OK programmers some context relevant hints or tips in editor as feedback would be ideal.

Also as ECS is so complex would some ECS tips for common misconceptions/mistakes also be useful for getting people up to speed...

At least until you bring out the Visual ECS programming where we just present the problem we want ECS to solve in graphical form and you do the rest...

5argon · Dec 16, 2018

I would like to see some examples of convertible bad ECS code pattern, and what would be the tips given for that pattern.

All I can think of are already in the form of library limitations that prevents you from going that way in the first place, like HPC# enforcement or aliasing prevention allowing good assembly to be made.

Then Burst can vectorize NativeArray iteration loop automatically, which basically SIMD the linear access for you. What other kind of automatic SIMD optimization would you like to have? Like if you have 3 floats adding with the same value and be given a tip to combine it into float3 in the first place? (It would be quite bizzarre if it tells me to change my data structure design)

Arowx · Dec 16, 2018

5argon said: ↑

I would like to see some examples of convertible bad ECS code pattern, and what would be the tips given for that pattern.

All I can think of are already in the form of library limitations that prevents you from going that way in the first place, like HPC# enforcement or aliasing prevention allowing good assembly to be made.

Then Burst can vectorize NativeArray iteration loop automatically, which basically SIMD the linear access for you. What other kind of automatic SIMD optimization would you like to have? Like if you have 3 floats adding with the same value and be given a tip to combine it into float3 in the first place? (It would be quite bizzarre if it tells me to change my data structure design)
Click to expand...

Check out this thread https://forum.unity.com/threads/branch-misprediction-in-systems.594241/

Here a developer is looking into how to optimize their ECS system, they get some feedback and benchmark a few different approaches. I ask for them to share their code and they do they are getting the distance of between two points then normalizing this direction vector, so they are doing the four calculations when they only need to do 3 (direction, magnitude, normalize by division).

This is in the inner loop of an ECS system their original code had function calls within this inner loop.

What if ECS warned you that calling a function has a big overhead that can reduce the performance of your system, or repeating the same calculations is less than optimal.

It's just good optimization tips and tricks that it takes time to learn and more time to test and profile to ensure they improve the compiled codes performance, however the developers of the Burst compiler probably know every trick and tip in the book the trouble is we don't so we can be giving the Burst compiler messy and slow code that it would struggle to optimize.

If you are a good or great programmer then you won't need any hints or tips as you will provide the Burst compiler with streamlined data and inline vectorised code that it will make great SIMD code from, for the rest of us a few tips and hints could mean the difference between thousands of cool effects and millions.

5argon · Dec 16, 2018

Ok, I had been to that thread. I think those optimal optimizations are pretty specific to the problem? And we understand it because we are human and we know the overall context. How can a machine tell if the function call (which could get unknown amount of variable data to work on) is big or small before running other than preventing function calls from happening at all, which is very drastic. Then we can't use properties too, and also indexers are properties and properties are functions.

Plus Burst already inline things, so function call not excluded from HPC# subset seems to be an intended behaviour that is still performant by default.

How to determine a workload of a code without actually running? What is the definition of "inner/tight loop" that is possible to detect from assemblies, without human looking on it?

And if it is possible to develop algorithm that could tell in any algorithm which repeated calculations (that is not consisted of constant values) are a wasted mistake instead of intended behaviour of the algorithm, that alone could be a big thesis paper. Imagine throwing that to optimize everything... even within HPC# constraint we could still get outside data from native containers. It is still uncertain until runtime if it is a wasted calculation or not.

I am interested to know some of the "every trick and tip in the book the trouble is" that is detectable by pattern/machine so I can at least avoid them manually first. I believe this area still requires human support. For a machine to be able to do it, other possibilty maybe train a neural network with bad patterns until it magically find one for us but then NN is good for results but not good for explaining why.

sngdan · Dec 17, 2018

I would be happy if the burst inspector could print the cpu cycles for the generated assembler lines

Arowx · Dec 17, 2018

sngdan said: ↑

I would be happy if the burst inspector could print the cpu cycles for the generated assembler lines
Click to expand...

This kind of feedback summary with hardware SIMD options/instructions that Burst could use if the program takes advantage of a different approach or uses a Burst optimization.

sngdan · Dec 18, 2018

I am expecting that the compiler does all the optimizations for me without much feedback. There will be of course a limit to this (I am not expecting magic).

If we could see the cpu cycles (intelligently summarized for branches / code paths) we could however, as an alternative to profiling, assess the effect of changes to our c# code on final assembly (if we don’t want/can to go there directly)

5argon · Dec 18, 2018

Since C# jobs are self contained to some degree, I think Unity-specific compiler optimization to allow rapid iteration on designing jobs is very possible. Like a super-incremental compiler which knows not just assembly level but job/struct level. Imagine editing a job code and the assembly updates instantly just like you can hot edit shader file and see the changes instantly. Dock a small panel locking to that job struct, set a reference compile and see before-after cpu cycles. That would be a dream to work on. Currently I had to take screenshot and compare. Number of cycles would look much more objective.

sngdan · Dec 18, 2018

Exactly this.

sngdan · Jan 11, 2019

@Joachim_Ante

Is adding cpu cycles to the assembly something you could consider - found myself today again in a situation, where like @5argon I have to copy code in text editor and go line by line....

sngdan · Jan 25, 2019

here another example, where it would be nice to get some form of simple guidance, which one translates to the most optimized assembly...

Code (CSharp):

static bool IsColliding1 (Box b1, Box b2) {

float2 d = math.abs(b1.Center - b2.Center);

float2 e = b1.Extends + b2.Extends;

return (math.all(d <= e));

}

static bool IsColliding2 (Box b1, Box b2) {

float2 d = math.abs(b1.Center - b2.Center);

float2 e = b1.Extends + b2.Extends;

return ((d.x <= e.x) && (d.y <= e.y));

}

static bool IsColliding3 (Box b1, Box b2) {

bool x = math.abs(b1.Center.x - b2.Center.x) <= (b1.Extends.x + b2.Extends.x);

bool y = math.abs(b1.Center.y - b2.Center.y) <= (b1.Extends.y + b2.Extends.y);

return x && y;

}

static bool IsColliding4 (Box b1, Box b2) {

if (math.abs(b1.Center.x - b2.Center.x) > (b1.Extends.x + b2.Extends.x)) return false;

if (math.abs(b1.Center.y - b2.Center.y) > (b1.Extends.y + b2.Extends.y)) return false;

return true;

}

edit: corrected a typo - thx to @M_R (math.any to math.all)

Deleted User · Jan 25, 2019

The only thing I have to go on for specifics is general knowledge and resharpers "clean code" extension.

While there might be better sections where the compiler recognises a possible fault and should tell you at compile time rather than runtime, its not exactly something a system should be weighed down with. Only thing you could wait for is an actual presentation or document that suggests why something is not as optimal. Until then sticking with standard rules and laws help your code to stay efficient even if its not 1mhz more efficient.
From what I remember hearing is that the burst compiler is optimised for specific types (such as the new unity.mathematics).

Out of those it would look like the 1st option is the more optimised assembly, without properly writing it out just trying to imagine the instruction count . Of course that comes down to the machine speed as well since JMP commands can be expensive (where you are returning).
Then theres accessing references, and accessing values when passing objects through other object references. Doing that in a return statement is pretty big. Additionally <= will create two instructions... Greater... and Equal, where the result needs to be greater AND equal.

sngdan · Jan 25, 2019

Thank you for your reply - the last time i did something in assembler was on the c64. The world was simple there, because the hardware (cpu) was the same and you could go by instruction cycles (edit: no simd, single core, etc.). There was no difference in cpu cycles for < or >= branching (bcc / bcs) and I doubt there is today. (edit: and even if there was, this would be such a simple optimization that I would expect the compiler to take care of)

Since we are not writing the assembly directly but have burst do it for us, my question is if there is a way the burst inspector could isolate the “execute” loop and either tell us a “measure” (like cpu cycles) for all branches — or simply for the executed one, and we have to write a test case for the branches ourself.

I know it is micro optimization and maybe wishful thinking given the different architectures we can compile to. But it feels the burst inspector is so close already...

Edit: A viewpoint could also be to say, use mathematics and floatX SIMD friendly instructions and then let Burst do its magic - or learn assembly.

M_R · Jan 25, 2019

@sngdan you can inspect the burst--compiled assembly in the burst window and count the lines yourself (or write a script that does it, the window source is available in the burst package (
BurstInspectorGUI
), maybe there is an API to get the parsed assembly)

sngdan · Jan 25, 2019

@M_R yes, that is what I am doing now (diffnow) - but I am afraid counting lines is not the answer.

It seams on my machine (mac) "IsCollliding2" outperforms the other variants both on IJob & IJobParallelFor - just measuring time

M_R · Jan 25, 2019

just noticed you use
math.any
. that is an OR, you should use
math.all
instead

Deleted User · Jan 25, 2019

@sngdan interesting, thought the first one would have had less instructions; unless math.any expands into some mess. Might have to investigate, math.all will probably do the same but has escape clauses since the left side of && could be false.

sngdan · Jan 25, 2019

@M_R - yes, math.any should have been math.all (I simplified the example for the forum and typed it wrong) - my test above works with math.all though ---- I will edit my post above to fix this

sngdan · Jan 26, 2019

So I ran the test again in a build (not editor) the finding is:
- Collision1 (math.all compilation) seems to have a significant performance penalty (+60%)
- Collision2, 3, 4 run at almost the same speed (measured with stopwatch ticks)
- Collision2 seems to be best at IJob and Collision4 best at IParallelFor (2-5% vs. 2nd place, not sure if this is really isolated to the jobs or influenced by other side effects)

pvloon · Jan 26, 2019

Something akin to BenmarkDotNet output for burst jobs (or, really, any function) integrated in unity would be a dream

M_R · Jan 28, 2019

there is a
com.unity.test-framework.performance
package

xoofx · Jan 28, 2019

Arowx said: ↑

Therefore if the super clever Burst compiler gave me the programmer some hints or tips based on how well my code compiled or how it could be improved to take advantage of SIMD instruction sets and other optimizations then wouldn't that massively improve what ECS can do or I can do with ECS?
Click to expand...

Indeed, we have a plan to provide this exact user-friendly optimization guidance with the burst inspector/compiler you are looking for. This is a work we definitively want to do in a future release.

sngdan · Jan 28, 2019

@xoofx great - good to know - this will be helpful for the final performance squeeze

@M_R Thx - i tried to use the framework just now, but I got tons of namespace errors, etc... and did not want to experiment beyond the instructions that did not work for me

Search Unity

Could Unity add in editor tips and hints on how best to write Burst code?

Arowx

5argon

Arowx

5argon

sngdan

Arowx

sngdan

5argon

sngdan

sngdan

sngdan

Deleted User

Guest

sngdan

M_R

sngdan

M_R

Deleted User

Guest

sngdan

sngdan

pvloon

M_R

xoofx

Unity Technologies

sngdan

Search Unity

Unity ID

Useful Searches

Could Unity add in editor tips and hints on how best to write Burst code?

Guest

Guest

Unity Technologies