Search Unity

Search

Benchmarking Burst/IL2CPP against GCC/Clang machine code (Fibonacci, Mandelbrot, NBody and others)

Discussion in 'Burst' started by nxrighthere, Jul 23, 2019.

Thread Status:: Not open for further replies.

Page 1 of 4

nxrighthere

Joined:

Mar 2, 2014

Posts:

567

I was curious how well Burst/IL2CPP optimizes C# code against GCC/Clang with C, so I've ported five famous benchmarks, plus a raytracer, a minified flocking simulation, particle kinematics, a stream cipher, and radix sort, with different workloads and made them identical between the two languages. C code compiled with all possible optimizations using -DNDEBUG -Ofast -march=native -flto compiler options. Benchmarks were done on Windows 10 w/ Ryzen 5 1400 using standalone build. Mono JIT and RyuJIT are included for fun.

Source code and benchmark results are available on GitHub.

These wonderful people make open-source better:

This project is sponsored by JetBrains.

Last edited: Mar 17, 2020

nxrighthere, Jul 23, 2019

#1

bit-master, Krajca, PutridEx and 17 others like this.
Arowx

Joined:

Nov 12, 2009

Posts:

8,194

nxrighthere said: ↑

Burst 1.1.1
GCC 8.1.0

(Burst) Fibonacci: 95,657,085 ticks
(GCC) Fibonacci: 103,578,985 ticks
(Mono JIT) Fibonacci: 195,152,736 ticks

Burst is slightly faster in the recursive Fibonacci.

(Burst) Mandelbrot: 65,528,410 ticks
(GCC) Mandelbrot: 28,788,322 ticks
(Mono JIT) Mandelbrot: 116,309,579 ticks

GCC is significantly leading in the fractal Mandelbrot.

(Burst) NBody: 122,715,369 ticks
(GCC) NBody: 203,429,182 ticks
(Mono JIT) NBody: 330,834,294 ticks

Burst is significantly leading in the NBody simulation.
Click to expand...

Just easier to see the numbers with thousand commas.

Arowx, Jul 23, 2019

#2

tigerleapgorge likes this.
nxrighthere

Joined:

Mar 2, 2014

Posts:

567

Ah, thanks. I was a bit lazy to format numbers manually.

nxrighthere, Jul 23, 2019

#3
Arowx

Joined:

Nov 12, 2009

Posts:

8,194
Code (CSharp):

private struct MandelbrotBurst : IJob {

public uint width;

public uint height;

public uint iterations;

public float result;

public void Execute() {

result = Mandelbrot(width, height, iterations);

}

private float Mandelbrot(uint width, uint height, uint iterations) {

float data = 0.0f;

for (int i = 0; i < iterations; i++) { // ideally you should have for loops broken up into jobs

float // should declare variables outside of loops (*1)

left = -2.1f,

right = 1.0f,

top = -1.3f,

bottom = 1.3f,

deltaX = (right - left) / width, //invert this e.g. invWidth = 1f / width; outside of the loop and multiply.

deltaY = (bottom - top) / height, //ditto for height.

coordinateX = left;

for (int x = 0; x < width; x++) { // ideally you should have for loops broken up into jobs

float coordinateY = top; // (*1)

for (int y = 0; y < height; y++) { // ideally you should have for loops broken up into jobs

float workX = 0; // (*1)

float workY = 0;

int counter = 0;

// should use Burst Mathermatics.Sqrt not Math.Sqrt

while (counter < 255 && Math.Sqrt((workX * workX) + (workY * workY)) < 2.0f) {

counter++;

// recalculating workx * workx and worky multiple times in the loop and conditional test.

float newX = (workX * workX) - (workY * workY) + coordinateX; //(*1)

workY = 2 * workX * workY + coordinateY;

workX = newX;

}

data = workX + workY;

coordinateY += deltaY;

}

coordinateX += deltaX;

}

}

return data;

}

}

Just some thoughts on how you can optimise this code.

Note you really should break down the for loops into jobs for maximum throughput.
Arowx, Jul 23, 2019

#4

tigerleapgorge likes this.
nxrighthere

Joined:

Mar 2, 2014

Posts:

567

Yes, such changes will have an impact. My main goal was to keep the code the same with C using only plain methods and loops with standard math, to see how compiler itself can optimize it for me without spending additional energy. Burst is optimized for tight loops in general, @xoofx already tried Mandelbrot test as far as I know, would like to hear his thoughts regarding this by the way.

nxrighthere, Jul 23, 2019

#5
nxrighthere

Joined:

Mar 2, 2014

Posts:

567

Just checked the source code of Unity.Mathematics and math.sqrt() is the same standard System.Math.Sqrt().

nxrighthere, Jul 23, 2019

#6
elcionap

Joined:

Jan 11, 2016

Posts:

138
nxrighthere said: ↑

Just checked the source code of Unity.Mathematics and math.sqrt() is the same standard System.Math.Sqrt().
Click to expand...

Just add more information:

Intrinsics
System.Math
Burst provides an intrinsic for all methods declared by System.Math except for the following methods that are not supported:

double IEEERemainder(double x, double y)

Round(double value, int digits)

Click to expand...

https://docs.unity3d.com/Packages/com.unity.burst@1.1/manual/index.html

TBH I think event if wasn't implemented as intrinsics would be great to show how burst deal with already existing code/libs.

Thanks for benchmarking it.

[]'s
elcionap, Jul 23, 2019

#7

nxrighthere likes this.
nxrighthere

Joined:

Mar 2, 2014

Posts:

567

Thanks, good to know this.

After inspecting an assembly code produced by GCC, found that to make it faster than Burst in the Fibonacci test, we can use -fno-asynchronous-unwind-tables to suppress the generation of static unwind tables for exception handling. 103,578,985 ticks down to 84,983,484, but other tests remain unaffected. Will add this as a note.

Last edited: Jul 23, 2019

nxrighthere, Jul 23, 2019

#8
Arowx

Joined:

Nov 12, 2009

Posts:

8,194

Can Burst optimise Sqrt?

Arowx, Jul 23, 2019

#9
Arowx

Joined:

Nov 12, 2009

Posts:

8,194
A -> (Burst) Mandelbrot: 33,005,292 ticks
B -> (Burst) Mandelbrot: 6,495,352 ticks

Version B

Code (CSharp):

private float Mandelbrot(uint width, uint height, uint iterations)

{

float data = 0.0f;

float

left = -2.1f,

right = 1.0f,

top = -1.3f,

bottom = 1.3f,

deltaX = (right - left) / width,

deltaY = (bottom - top) / height,

coordinateX = left;

float coordinateY; // moved variables outside of loops

float workX = 0;

float workY = 0;

int counter = 0;

int x;

int y;

float newX;

float workX2 = 0; // added variable for square values that are reused.

float workY2 = 0;

for (int i = 0; i < iterations; i++)

{

for (x = 0; x < width; x++)

{

coordinateY = top;

for (y = 0; y < height; y++)

{

workX = 0;

workY = 0;

counter = 0;

workX2 = 0;

workY2 = 0;

while (counter < 255 && Math.Sqrt((workX2 + workY2)) < 2.0f)

{

counter++;

newX = (workX2) - (workY2) + coordinateX;

workY = 2 * workX * workY + coordinateY;

workX = newX;

workX2 = workX * workX;

workY2 = workY * workY;

}

data = workX + workY;

coordinateY += deltaY;

}

coordinateX += deltaX;

}

}

return data;

}

}
Arowx, Jul 23, 2019

#10

webik150 and tigerleapgorge like this.
Arowx

Joined:

Nov 12, 2009

Posts:

8,194

Actually I wonder if the Mandelbrot and NBody could gain from working with float2/double3's as it might give the Burst compiler more SIMD options than just using basic variables.

Last edited: Jul 24, 2019

Arowx, Jul 24, 2019

#11
nxrighthere

Joined:

Mar 2, 2014

Posts:

567

Arowx said: ↑

Can Burst optimise Sqrt?
Click to expand...

Since it's implemented as the intrinsic, Burst utilizes built into compiler function for square root.

Arowx said: ↑

A -> (Burst) Mandelbrot: 33,005,292 ticks
B -> (Burst) Mandelbrot: 6,495,352 ticks
Click to expand...

Interesting, so nested stack allocations makes it significantly slower, gonna check how it works on my machine and compare it to GCC.

Arowx said: ↑

Actually I wonder if the Mandelbrot and NBody could gain from working with float2/double3's as it might give the Burst compiler more SIMD options than just using basic variables.
Click to expand...

Absolutely, this will have an impact as well.

nxrighthere, Jul 24, 2019

#12
Arowx

Joined:

Nov 12, 2009

Posts:

8,194

Apparently recursive fibonacci is the slowest way to do it, you might want to try the fastest fibonacci algorithm https://www.nayuki.io/page/fast-fibonacci-algorithms

Arowx, Jul 24, 2019

#13
Arowx

Joined:

Nov 12, 2009

Posts:

8,194

Also there is a SIMD optimised version of the mandelbrot set algorithm here -> https://martin-ueding.de/articles/mandelbrot-performance/index.html

The developer reports a doubling of performance between the C++ implementation and the SIMD one, in theory Burst should match or beat the SIMD version.

Arowx, Jul 24, 2019

#14
Joachim_Ante

Unity Technologies

Joined:

Mar 16, 2005

Posts:

5,203

Doing fair benchmarks is hard. The first thing to be clear about is what does the benchmark try to benchmark?

1. Taking completely unoptimised scalar code and see how well the auto-vectoriser and code optimisations work out.
2. You want to write general purpose cross platform performant code. Are willing to spend the time to force the compiler to generate good code. As much as possible you are using explicit float4 SOA vectorised code. You want control.
3. You are a SIMD performance expert, you understand the target platform and it's assembly instructions and want to manually lay out exactly the exact instructions you want to run.

#1 burst is competitive in against C. There is still lots we can and will do. As you found there is benchmarks slower & faster. Generally it's not a very predictable model. Very hard to find out why something is slow or fast. This is also the domain that most optimisation research in the C/C++ compiler space goes into. The biggest issue in this space for games is that a tiny change to the code can result in massive differences to performance due to non-predictable nature of this approach.

#2 is what burst is best at. Thats where we currently focus on. To a large extent this comes down to usability of the math library. Making it easy to write SOA SIMD code.

#3 is something where we want to add architecture specific intrinsics for to make this true

It seems like the benchmark is all about #1. So essentially it's a benchmark for code that is not written to be optimised. Don't get me wrong, most code out there is written exactly like that, so there is value in a compiler making it as fast as it can. But if you care about performance, thats not how you write code. So a benchmark purely focused on #1 doesn't seem right.

Last edited: Jul 24, 2019

Joachim_Ante, Jul 24, 2019

#15

Velctor, SuperRaffles, l33t_P4j33t and 1 other person like this.
nxrighthere

Joined:

Mar 2, 2014

Posts:

567

@Arowx I intentionally used expensive algorithms for tests similarly to how compiler engineers are doing that. The goal is to get bare numbers that compiler could give you without spending human's energy on optimizations.

@Joachim_Ante Yes, I absolutely agree. Since the Burst itself is not a general-purpose compiler and at the tip of the iceberg, it's a transpiler essentially which designed for specific use-cases. I'll add more benchmarks that will cover various cases very soon, just for experiments. Thanks.

nxrighthere, Jul 24, 2019

#16
nxrighthere

Joined:

Mar 2, 2014

Posts:

567

Added Sieve of Eratosthenes, GCC is 1% faster than Burst:

(Burst) Sieve of Eratosthenes: 43,449,732 ticks
(GCC) Sieve of Eratosthenes: 42,965,656 ticks
(Mono JIT) Sieve of Eratosthenes: 55,741,659 ticks

nxrighthere, Jul 24, 2019

#17
slime73

Joined:

May 14, 2017

Posts:

107

IL2CPP versions of those benchmarks might be interesting. I guess you'd probably just need to prevent burst compilation of the job code, and build an il2cpp version of the player.

slime73, Jul 24, 2019

#18
nxrighthere

Joined:

Mar 2, 2014

Posts:

567

Yes, I was thinking about this. I'm going to provide IL2CPP results very soon after adding tiny Pixar Raytracer.

nxrighthere, Jul 24, 2019

#19
Arowx

Joined:

Nov 12, 2009

Posts:

8,194
Attempt at vectorising the NBody simulation...

Code (CSharp):

// NBody

private struct NBody

{

public double3 xyz, vxyz;

public double mass;

}

[BurstCompile]

private unsafe struct NBodyBurst : IJob

{

public uint advancements;

public double result;

public void Execute()

{

result = NBody(advancements);

}

private double NBody(uint advancements)

{

NBody* sun = stackalloc NBody[5];

NBody* end = sun + 4;

InitializeBodies(sun, end);

Energy(sun, end);

while (advancements-- > 0)

{

Advance(sun, end, 0.01d);

}

Energy(sun, end);

return sun[0].xyz.x + sun[0].xyz.y;

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]

private void InitializeBodies(NBody* sun, NBody* end)

{

const double pi = 3.141592653589793;

const double solarmass = 4 * pi * pi;

const double daysPerYear = 365.24;

unchecked

{

sun[1] = new NBody

{ // Jupiter

xyz = new double3( 4.84143144246472090e+00,

-1.16032004402742839e+00,

-1.03622044471123109e-01 ),

vxyz = new double3( 1.66007664274403694e-03 * daysPerYear,

7.69901118419740425e-03 * daysPerYear,

-6.90460016972063023e-05 * daysPerYear),

mass = 9.54791938424326609e-04 * solarmass

};

sun[2] = new NBody

{ // Saturn

xyz = new double3(8.34336671824457987e+00,

4.12479856412430479e+00,

-4.03523417114321381e-01),

vxyz = new double3(

-2.76742510726862411e-03 * daysPerYear,

4.99852801234917238e-03 * daysPerYear,

2.30417297573763929e-05 * daysPerYear),

mass = 2.85885980666130812e-04 * solarmass

};

sun[3] = new NBody

{ // Uranus

xyz = new double3(1.28943695621391310e+01,

-1.51111514016986312e+01,

-2.23307578892655734e-01),

vxyz = new double3(2.96460137564761618e-03 * daysPerYear,

2.37847173959480950e-03 * daysPerYear,

-2.96589568540237556e-05 * daysPerYear),

mass = 4.36624404335156298e-05 * solarmass

};

sun[4] = new NBody

{ // Neptune

xyz = new double3(1.53796971148509165e+01,

-2.59193146099879641e+01,

1.79258772950371181e-01),

vxyz = new double3(2.68067772490389322e-03 * daysPerYear,

1.62824170038242295e-03 * daysPerYear,

-9.51592254519715870e-05 * daysPerYear),

mass = 5.15138902046611451e-05 * solarmass

};

double3 v = new double3();

for (NBody* planet = sun + 1; planet <= end; ++planet)

{

double mass = planet->mass;

v += planet->vxyz * mass;

}

sun->mass = solarmass;

sun->vxyz = v / -solarmass;

}

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]

private void Energy(NBody* sun, NBody* end)

{

unchecked

{

double e = 0.0;

double imass;

double3 ixyz, ivxyz;

NBody* bj;

NBody* bi;

double jmass;

double3 dxyz;

for (bi = sun; bi <= end; ++bi)

{

imass = bi->mass;

ixyz = bi->xyz;

ivxyz = bi->vxyz;

e += 0.5 * imass * (math.length(ivxyz) * math.length(ivxyz));

for (bj = bi + 1; bj <= end; ++bj)

{

jmass = bj->mass;

dxyz = ixyz - bj->xyz;

e -= imass * jmass / Math.Sqrt(math.dot(dxyz,dxyz));

}

}

}

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]

private double GetD2(double dx, double dy, double dz)

{

double d2 = dx * dx + dy * dy + dz * dz;

return d2 * Math.Sqrt(d2);

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]

private void Advance(NBody* sun, NBody* end, double distance)

{

unchecked

{

double3 ixyz, ivxyz, dxyz;

double jmass, imass, mag;

NBody* bi, bj;

for (bi = sun; bi < end; ++bi)

{

ixyz = bi->xyz;

ivxyz = bi->vxyz;

imass = bi->mass;

for (bj = bi + 1; bj <= end; ++bj)

{

dxyz = bj->xyz - ixyz;

jmass = bj->mass;

mag = distance / math.lengthsq(dxyz);

bj->vxyz = bj->vxyz - dxyz * imass * mag;

ivxyz = ivxyz + dxyz * jmass * mag;

}

bi->vxyz = ivxyz;

bi->xyz = ixyz + ivxyz * distance;

}

end->xyz = end->xyz + end->vxyz * distance;

}

}

}

My theory is that the use of double3 should allow Burst to vectorise more operations??
Arowx, Jul 24, 2019

#20
nxrighthere

Joined:

Mar 2, 2014

Posts:

567

It's not guaranteed, vectorization may fail in various circumstances. Usually, a programmer follows particular criteria to achieve auto-vectorization within capabilities of a compiler (nested functions, data dependencies, conditional sentences), and it's beyond of simple replacement of datatype. Burst has its rules for this as well, and it should have a verbose indication of failures.

nxrighthere, Jul 24, 2019

#21
Arowx

Joined:

Nov 12, 2009

Posts:

8,194

(Burst) NBodyV: 69,718,514 ticks -- double3 and moving variables out of loops.
(Burst) NBody: 84,495,822 ticks
(Mono JIT) NBody: 457,873,937 ticks

Arowx, Jul 24, 2019

#22
nxrighthere

Joined:

Mar 2, 2014

Posts:

567

Can you post numbers without moving variables out of loops?

nxrighthere, Jul 24, 2019

#23
Arowx

Joined:

Nov 12, 2009

Posts:

8,194

nxrighthere said: ↑

Can you post numbers without moving variables out of loops?
Click to expand...

I'm only moving the variable declarations, it's a basic optimisation all programmers should learn and adopt.

Arowx, Jul 24, 2019

#24

MadeFromPolygons likes this.
nxrighthere

Joined:

Mar 2, 2014

Posts:

567

Sure, so what about numbers?

nxrighthere, Jul 24, 2019

#25

JesOb likes this.
nxrighthere

Joined:

Mar 2, 2014

Posts:

567

Added Pixar Raytracer:

(Burst) Pixar Raytracer: 245,079,625 ticks
(GCC) Pixar Raytracer: 71,429,484 ticks
(Mono JIT) Pixar Raytracer: 1,388,610,887 ticks

nxrighthere, Jul 25, 2019

#26
Arowx

Joined:

Nov 12, 2009

Posts:

8,194

nxrighthere said: ↑

Sure, so what about numbers?
Click to expand...

(Burst) NBodyV: 69,718,514 ticks -- double3 and moving variables out of loops.
(Burst) NBody: 84,495,822 ticks
(Mono JIT) NBody: 457,873,937 ticks

Arowx, Jul 25, 2019

#27
Wokky

Joined:

Apr 3, 2014

Posts:

11
nxrighthere said: ↑

Can you post numbers without moving variables out of loops?
Click to expand...

I was very skeptical about this, so I put together a simple comparison based on the Mandlebrot benchmark.

Here's a version that moves unnecessary work out of the loop but keeps variable declarations within their relevant scopes:

Code (CSharp):

float MandelbrotDeclareWithinLoops(uint width, uint height, uint iterations) {

float data = 0.0f;

const float LEFT = -2.1f;

const float RIGHT = 1.0f;

const float TOP = -1.3f;

const float BOTTOM = 1.3f;

float deltaX = (RIGHT - LEFT) / width;

float deltaY = (BOTTOM - TOP) / height;

for (int i = 0; i < iterations; i++) {

// Declaring within loops

float coordinateX = LEFT;

for (int x = 0; x < width; x++) {

float coordinateY = TOP;

for (int y = 0; y < height; y++) {

float workX = 0;

float workY = 0;

float counter = 0;

while (counter < 255 && math.sqrt((workX * workX) + (workY * workY)) < 2.0f) {

counter++;

float workXSquared = workX * workX;

workX = workXSquared - (workY * workY) + coordinateX;

workY = 2 * workXSquared + coordinateY;

}

data = workX + workY;

coordinateY += deltaY;

}

coordinateX += deltaX;

}

}

return data;

}

And here's a version where all variables are hoisted out of the loop:

Code (CSharp):

float MandelbrotHoistedDeclarations(uint width, uint height, uint iterations) {

float data = 0.0f;

const float LEFT = -2.1f;

const float RIGHT = 1.0f;

const float TOP = -1.3f;

const float BOTTOM = 1.3f;

float deltaX = (RIGHT - LEFT) / width;

float deltaY = (BOTTOM - TOP) / height;

// Variable declarations hoisted out of loop

float coordinateX;

float coordinateY;

float workX;

float workY;

float counter;

float workXSquared;

for (int i = 0; i < iterations; i++) {

coordinateX = LEFT;

for (int x = 0; x < width; x++) {

coordinateY = TOP;

for (int y = 0; y < height; y++) {

workX = 0;

workY = 0;

counter = 0;

while (counter < 255 && math.sqrt((workX * workX) + (workY * workY)) < 2.0f) {

counter++;

workXSquared = workX * workX;

workX = workXSquared - (workY * workY) + coordinateX;

workY = 2 * workXSquared + coordinateY;

}

data = workX + workY;

coordinateY += deltaY;

}

coordinateX += deltaX;

}

}

return data;

}

Burst produces functionally identical assembly output for both versions, which is what I would expect from the majority of compilers in use.
Wokky, Jul 25, 2019

#28

RunninglVlan, neonblitzer and nxrighthere like this.
Arowx

Joined:

Nov 12, 2009

Posts:

8,194

Wokky said: ↑

Burst produces functionally identical assembly output for both versions, which is what I would expect from the majority of compilers in use.
Click to expand...

Could you please post the assembly outputs and benchmark the two for comparison?

Last edited: Jul 25, 2019

Arowx, Jul 25, 2019

#29
Wokky

Joined:

Apr 3, 2014

Posts:

11
On an additional note, shouldn't you be using
CompileSynchronously = true
with these benchmarks?
Wokky, Jul 25, 2019

#30

nxrighthere likes this.
Wokky

Joined:

Apr 3, 2014

Posts:

11

Arowx said: ↑

Could you post please and benchmark the two for comparison?
Click to expand...

It's a lot of text to embed in a forum post, but if you don't believe me it's pretty easy to copy it into a project, open the Burst Inspector and see for yourself.

Wokky, Jul 25, 2019

#31

eizenhorn likes this.
Arowx

Joined:

Nov 12, 2009

Posts:

8,194

Wokky said: ↑

It's a lot of text to embed in a forum post, but if you don't believe me it's pretty easy to copy it into a project, open the Burst Inspector and see for yourself.
Click to expand...

That's what the spoiler forum tags are for.

Arowx, Jul 25, 2019

#32

Nothke likes this.
Joachim_Ante

Unity Technologies

Joined:

Mar 16, 2005

Posts:

5,203
Wokky said: ↑

On an additional note, shouldn't you be using
CompileSynchronously = true
with these benchmarks?
Click to expand...

CompileSynchronously = true needs to be there for sure. On top of that there should be a warm up iteration that does not get measured. There is a real cost to running something for the first time, we have to extract & cache job reflection data on first run. CompileSync = true + one warm up iteration fixes that and is what we use in our own benchmarks.

Why are you using doubles instead of float in the nbody simulation? If you care about performance double is used very rarely in games, only in places that are not performance sensitive.
Last edited: Jul 25, 2019

Joachim_Ante, Jul 25, 2019

#33

MNNoxMortem, hippocoder, nxrighthere and 1 other person like this.
Wokky

Joined:

Apr 3, 2014

Posts:

11

.text
.intel_syntax noprefix
.file "main"
.section .rodata.cst4,"aM",@progbits,4
.p2align 2
.LCPI0_0:
.long 1078355558
.LCPI0_1:
.long 1076258406
.LCPI0_2:
.long 3221644902
.LCPI0_3:
.long 3215353446
.LCPI0_4:
.long 3204448256
.LCPI0_5:
.long 3225419776
.LCPI0_6:
.long 1073741824
.text
.globl "Unity.Jobs.IJobExtensions.JobStruct`1<Benchmarks.MandlebrotA>.Execute(ref Benchmarks.MandlebrotA data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_401C57F7CC7F16AB"
.p2align 4, 0x90
.type "Unity.Jobs.IJobExtensions.JobStruct`1<Benchmarks.MandlebrotA>.Execute(ref Benchmarks.MandlebrotA data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_401C57F7CC7F16AB",@function
"Unity.Jobs.IJobExtensions.JobStruct`1<Benchmarks.MandlebrotA>.Execute(ref Benchmarks.MandlebrotA data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_401C57F7CC7F16AB":
push rsi
sub rsp, 176
movaps xmmword ptr [rsp + 160], xmm15
movaps xmmword ptr [rsp + 144], xmm14
movaps xmmword ptr [rsp + 128], xmm13
movaps xmmword ptr [rsp + 112], xmm12
movaps xmmword ptr [rsp + 96], xmm11
movaps xmmword ptr [rsp + 80], xmm10
movaps xmmword ptr [rsp + 64], xmm9
movaps xmmword ptr [rsp + 48], xmm8
movaps xmmword ptr [rsp + 32], xmm7
movaps xmmword ptr [rsp + 16], xmm6
mov r8d, dword ptr [rcx + 8]
test r8, r8
xorps xmm1, xmm1
je .LBB0_19
mov r10d, dword ptr [rcx]
test r10d, r10d
je .LBB0_2
mov eax, dword ptr [rcx + 4]
test eax, eax
je .LBB0_15
cvtsi2ss xmm0, r10
movabs rdx, offset .LCPI0_0
movss xmm1, dword ptr [rdx]
divss xmm1, xmm0
movss dword ptr [rsp + 12], xmm1
xorps xmm0, xmm0
cvtsi2ss xmm0, rax
movabs rdx, offset .LCPI0_1
movss xmm11, dword ptr [rdx]
divss xmm11, xmm0
xor r9d, r9d
movabs rdx, offset .LCPI0_2
movss xmm0, dword ptr [rdx]
movss dword ptr [rsp + 8], xmm0
movabs rdx, offset .LCPI0_3
movss xmm10, dword ptr [rdx]
movabs rdx, offset .LCPI0_4
movss xmm13, dword ptr [rdx]
movabs rdx, offset .LCPI0_5
movss xmm14, dword ptr [rdx]
xorps xmm12, xmm12
movabs rdx, offset .LCPI0_6
movss xmm15, dword ptr [rdx]
.p2align 4, 0x90
.LBB0_6:
movss xmm7, dword ptr [rsp + 8]
xor r11d, r11d
.p2align 4, 0x90
.LBB0_7:
xor edx, edx
movaps xmm4, xmm10
.p2align 4, 0x90
.LBB0_8:
xorps xmm1, xmm1
mov esi, -1
xorps xmm8, xmm8
.p2align 4, 0x90
.LBB0_9:
movaps xmm0, xmm1
mulss xmm0, xmm0
movaps xmm6, xmm8
mulss xmm6, xmm6
movaps xmm5, xmm6
addss xmm5, xmm0
xorps xmm3, xmm3
rsqrtss xmm3, xmm5
movaps xmm2, xmm5
mulss xmm2, xmm3
movaps xmm9, xmm2
mulss xmm9, xmm13
mulss xmm2, xmm3
addss xmm2, xmm14
mulss xmm2, xmm9
cmpeqss xmm5, xmm12
andnps xmm5, xmm2
ucomiss xmm5, xmm15
jae .LBB0_11
movaps xmm1, xmm7
subss xmm1, xmm6
addss xmm1, xmm0
addss xmm0, xmm0
addss xmm0, xmm4
inc esi
cmp esi, 253
movaps xmm8, xmm0
jbe .LBB0_9
.LBB0_11:
addss xmm4, xmm11
inc edx
movsxd rsi, edx
cmp rsi, rax
jl .LBB0_8
addss xmm7, dword ptr [rsp + 12]
inc r11d
movsxd rdx, r11d
cmp rdx, r10
jl .LBB0_7
inc r9d
movsxd rdx, r9d
cmp rdx, r8
jl .LBB0_6
addss xmm1, xmm8
jmp .LBB0_19
.LBB0_2:
mov eax, 1
.p2align 4, 0x90
.LBB0_3:
movsxd rdx, eax
lea eax, [rdx + 1]
cmp rdx, r8
jl .LBB0_3
jmp .LBB0_19
.LBB0_15:
xor eax, eax
.p2align 4, 0x90
.LBB0_16:
mov edx, 1
.p2align 4, 0x90
.LBB0_17:
movsxd rsi, edx
lea edx, [rsi + 1]
cmp rsi, r10
jl .LBB0_17
inc eax
movsxd rdx, eax
cmp rdx, r8
jl .LBB0_16
.LBB0_19:
movss dword ptr [rcx + 12], xmm1
movaps xmm6, xmmword ptr [rsp + 16]
movaps xmm7, xmmword ptr [rsp + 32]
movaps xmm8, xmmword ptr [rsp + 48]
movaps xmm9, xmmword ptr [rsp + 64]
movaps xmm10, xmmword ptr [rsp + 80]
movaps xmm11, xmmword ptr [rsp + 96]
movaps xmm12, xmmword ptr [rsp + 112]
movaps xmm13, xmmword ptr [rsp + 128]
movaps xmm14, xmmword ptr [rsp + 144]
movaps xmm15, xmmword ptr [rsp + 160]
add rsp, 176
pop rsi
ret
.Lfunc_end0:
.size "Unity.Jobs.IJobExtensions.JobStruct`1<Benchmarks.MandlebrotA>.Execute(ref Benchmarks.MandlebrotA data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_401C57F7CC7F16AB", .Lfunc_end0-"Unity.Jobs.IJobExtensions.JobStruct`1<Benchmarks.MandlebrotA>.Execute(ref Benchmarks.MandlebrotA data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_401C57F7CC7F16AB"

.globl burst.initialize
.p2align 4, 0x90
.type burst.initialize,@function
burst.initialize:
ret
.Lfunc_end1:
.size burst.initialize, .Lfunc_end1-burst.initialize

.section ".note.GNU-stack","",@progbits

.text
.intel_syntax noprefix
.file "main"
.section .rodata.cst4,"aM",@progbits,4
.p2align 2
.LCPI0_0:
.long 1078355558
.LCPI0_1:
.long 1076258406
.LCPI0_2:
.long 3221644902
.LCPI0_3:
.long 3215353446
.LCPI0_4:
.long 3204448256
.LCPI0_5:
.long 3225419776
.LCPI0_6:
.long 1073741824
.text
.globl "Unity.Jobs.IJobExtensions.JobStruct`1<Benchmarks.MandelbrotB>.Execute(ref Benchmarks.MandelbrotB data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_45973D4CD709E184"
.p2align 4, 0x90
.type "Unity.Jobs.IJobExtensions.JobStruct`1<Benchmarks.MandelbrotB>.Execute(ref Benchmarks.MandelbrotB data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_45973D4CD709E184",@function
"Unity.Jobs.IJobExtensions.JobStruct`1<Benchmarks.MandelbrotB>.Execute(ref Benchmarks.MandelbrotB data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_45973D4CD709E184":
push rsi
sub rsp, 176
movaps xmmword ptr [rsp + 160], xmm15
movaps xmmword ptr [rsp + 144], xmm14
movaps xmmword ptr [rsp + 128], xmm13
movaps xmmword ptr [rsp + 112], xmm12
movaps xmmword ptr [rsp + 96], xmm11
movaps xmmword ptr [rsp + 80], xmm10
movaps xmmword ptr [rsp + 64], xmm9
movaps xmmword ptr [rsp + 48], xmm8
movaps xmmword ptr [rsp + 32], xmm7
movaps xmmword ptr [rsp + 16], xmm6
mov r8d, dword ptr [rcx + 8]
test r8, r8
xorps xmm1, xmm1
je .LBB0_19
mov r10d, dword ptr [rcx]
test r10d, r10d
je .LBB0_2
mov eax, dword ptr [rcx + 4]
test eax, eax
je .LBB0_15
cvtsi2ss xmm0, r10
movabs rdx, offset .LCPI0_0
movss xmm1, dword ptr [rdx]
divss xmm1, xmm0
movss dword ptr [rsp + 12], xmm1
xorps xmm0, xmm0
cvtsi2ss xmm0, rax
movabs rdx, offset .LCPI0_1
movss xmm11, dword ptr [rdx]
divss xmm11, xmm0
xor r9d, r9d
movabs rdx, offset .LCPI0_2
movss xmm0, dword ptr [rdx]
movss dword ptr [rsp + 8], xmm0
movabs rdx, offset .LCPI0_3
movss xmm10, dword ptr [rdx]
movabs rdx, offset .LCPI0_4
movss xmm13, dword ptr [rdx]
movabs rdx, offset .LCPI0_5
movss xmm14, dword ptr [rdx]
xorps xmm12, xmm12
movabs rdx, offset .LCPI0_6
movss xmm15, dword ptr [rdx]
.p2align 4, 0x90
.LBB0_6:
movss xmm7, dword ptr [rsp + 8]
xor r11d, r11d
.p2align 4, 0x90
.LBB0_7:
xor edx, edx
movaps xmm2, xmm10
.p2align 4, 0x90
.LBB0_8:
xorps xmm1, xmm1
mov esi, -1
xorps xmm8, xmm8
.p2align 4, 0x90
.LBB0_9:
movaps xmm0, xmm1
mulss xmm0, xmm0
movaps xmm6, xmm8
mulss xmm6, xmm6
movaps xmm5, xmm6
addss xmm5, xmm0
xorps xmm3, xmm3
rsqrtss xmm3, xmm5
movaps xmm4, xmm5
mulss xmm4, xmm3
movaps xmm9, xmm4
mulss xmm9, xmm13
mulss xmm4, xmm3
addss xmm4, xmm14
mulss xmm4, xmm9
cmpeqss xmm5, xmm12
andnps xmm5, xmm4
ucomiss xmm5, xmm15
jae .LBB0_11
movaps xmm1, xmm7
subss xmm1, xmm6
addss xmm1, xmm0
addss xmm0, xmm0
addss xmm0, xmm2
inc esi
cmp esi, 253
movaps xmm8, xmm0
jbe .LBB0_9
.LBB0_11:
addss xmm2, xmm11
inc edx
movsxd rsi, edx
cmp rsi, rax
jl .LBB0_8
addss xmm7, dword ptr [rsp + 12]
inc r11d
movsxd rdx, r11d
cmp rdx, r10
jl .LBB0_7
inc r9d
movsxd rdx, r9d
cmp rdx, r8
jl .LBB0_6
addss xmm1, xmm8
jmp .LBB0_19
.LBB0_2:
mov eax, 1
.p2align 4, 0x90
.LBB0_3:
movsxd rdx, eax
lea eax, [rdx + 1]
cmp rdx, r8
jl .LBB0_3
jmp .LBB0_19
.LBB0_15:
xor eax, eax
.p2align 4, 0x90
.LBB0_16:
mov edx, 1
.p2align 4, 0x90
.LBB0_17:
movsxd rsi, edx
lea edx, [rsi + 1]
cmp rsi, r10
jl .LBB0_17
inc eax
movsxd rdx, eax
cmp rdx, r8
jl .LBB0_16
.LBB0_19:
movss dword ptr [rcx + 12], xmm1
movaps xmm6, xmmword ptr [rsp + 16]
movaps xmm7, xmmword ptr [rsp + 32]
movaps xmm8, xmmword ptr [rsp + 48]
movaps xmm9, xmmword ptr [rsp + 64]
movaps xmm10, xmmword ptr [rsp + 80]
movaps xmm11, xmmword ptr [rsp + 96]
movaps xmm12, xmmword ptr [rsp + 112]
movaps xmm13, xmmword ptr [rsp + 128]
movaps xmm14, xmmword ptr [rsp + 144]
movaps xmm15, xmmword ptr [rsp + 160]
add rsp, 176
pop rsi
ret
.Lfunc_end0:
.size "Unity.Jobs.IJobExtensions.JobStruct`1<Benchmarks.MandelbrotB>.Execute(ref Benchmarks.MandelbrotB data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_45973D4CD709E184", .Lfunc_end0-"Unity.Jobs.IJobExtensions.JobStruct`1<Benchmarks.MandelbrotB>.Execute(ref Benchmarks.MandelbrotB data, System.IntPtr additionalPtr, System.IntPtr bufferRangePatchData, ref Unity.Jobs.LowLevel.Unsafe.JobRanges ranges, int jobIndex)_45973D4CD709E184"

.globl burst.initialize
.p2align 4, 0x90
.type burst.initialize,@function
burst.initialize:
ret
.Lfunc_end1:
.size burst.initialize, .Lfunc_end1-burst.initialize

.section ".note.GNU-stack","",@progbits

Spot the difference

Wokky, Jul 25, 2019

#34

nxrighthere likes this.
xoofx

Unity Technologies

Joined:

Nov 5, 2016

Posts:

417
nxrighthere said: ↑

I was curious how well Burst optimizes C# code against GCC with C, so I ported four famous benchmarks and a raytracer with different workloads and made them identical between the two. C code compiled with all possible optimizations using -DNDEBUG -Ofast -march=native -flto compiler options. Benchmarks were done on Windows 10 w/ AMD FX-4300 (4GHz) using standalone build. Mono JIT is included for fun.
Click to expand...

Thanks for this benchmark @nxrighthere, that's cool!

So without looking deep into the code, I just have two preamble remarks:

1. You should make sure that you job in burst is using fast calculation (as you setup it for GCC), otherwise codegen wise I believe that it is going to be a significant disadvantage for burst (so setup via

Code (CSharp):

[BurstCompile(FloatPrecision.Standard, FloatMode.Fast)]

)
2. The comparison between GCC 8.1.0 (a relatively recent version) can also scramble a bit the results with Burst which is using Clang 6.0 (old clang, but we have a plan to upgrade to 8.x in the coming weeks)
xoofx, Jul 25, 2019

#35

eizenhorn and nxrighthere like this.
xoofx

Unity Technologies

Joined:

Nov 5, 2016

Posts:

417

(So correlated to 2., it would be interesting to see how Clang performs on the same C code for example)

xoofx, Jul 25, 2019

#36

nxrighthere likes this.
xoofx

Unity Technologies

Joined:

Nov 5, 2016

Posts:

417

Also as stated above by @Joachim_Ante you should warmup run your burst job without measuring it to make sure that the test doesn't count burst compilation, so usually prefer compile synchronously, and warmup

xoofx, Jul 25, 2019

#37

kalineh and nxrighthere like this.
xoofx

Unity Technologies

Joined:

Nov 5, 2016

Posts:

417

Just emphasis why "fast" mode (remark 1. above) can make a huge difference:

Fast allows the compiler to reorder floating point instructions, by loosing floating point strictness in favor of speed, and this alone in burst/LLVM or GCC brings a lot more optimization opportunities for the compiler: collapse many instructions including vectorization across instructions (It is called SLP vectorizer in LLVM) and potentially loop vectorization (but I believe that in the case of Raycaster, without [NoAlias] hints on pointers arguments in the code, you should not have any)

Last edited: Jul 25, 2019

xoofx, Jul 25, 2019

#38

kalineh and nxrighthere like this.
nxrighthere

Joined:

Mar 2, 2014

Posts:

567

@Wokky @Joachim_Ante @xoofx Ah, thanks for all the notes! Indeed, CompileSynchronously = true, warm-up, and float mode is something that I was wasn't aware (not very familiar with Burst options at the moment), will address this quickly.

As for double utilization in the NBody, yes, I agree, just did that out of curiosity, performance-wise float should be used instead for sure.

Clang and IL2CPP will be added very soon for the sake of completeness.

nxrighthere, Jul 25, 2019

#39

JesOb and Shinyclef like this.
nxrighthere

Joined:

Mar 2, 2014

Posts:

567

Done, you can find the diff with numbers here. Noticeable changes: FloatPrecision.Standard, FloatMode.Fast improved the Mandelbrot performance by 52%, it's now very close to GCC. Fast float mode makes the raytracer up to 91% slower than default one on my machine, so I haven't used it there. Everything else remains almost the same. NBody with double-precision floating-point unaffected.

Last edited: Jul 25, 2019

nxrighthere, Jul 25, 2019

#40

Shinyclef and Deleted User like this.
Arowx

Joined:

Nov 12, 2009

Posts:

8,194

I'm seeing if I can improve the Raytracing results, as it's code base is designed to be minimal and the opposite of a good benchmark IMHO.

Initial results
(Burst) Pixar Raytracer: 167,786,775 ticks
After some tweaks
(Burst) Pixar Raytracer: 160,050,339 ticks

Tried converting to float4 for maximum Burst but actually got a surprise...
(Burst) Pixar Raytracer: 161,807,136 ticks

It was slower than float3...?

Arowx, Jul 25, 2019

#41

Shinyclef likes this.
Arowx

Joined:

Nov 12, 2009

Posts:

8,194
xoofx said: ↑

Thanks for this benchmark @nxrighthere, that's cool!

So without looking deep into the code, I just have two preamble remarks:

1. You should make sure that you job in burst is using fast calculation (as you setup it for GCC), otherwise codegen wise I believe that it is going to be a significant disadvantage for burst (so setup via

Code (CSharp):

[BurstCompile(FloatPrecision.Standard, FloatMode.Fast)]

)
2. The comparison between GCC 8.1.0 (a relatively recent version) can also scramble a bit the results with Burst which is using Clang 6.0 (old clang, but we have a plan to upgrade to 8.x in the coming weeks)
Click to expand...

Code (CSharp):

[BurstCompile(FloatPrecision.Standard, FloatMode.Fast)]

Produces a big drop in performance for me e.g.
(Burst) Pixar Raytracer
149,860,110 ticks [BurstCompile]
>222,000,000 ticks [BurstCompile(FloatPrecision.Standard, FloatMode.Fast)]

(Burst) Pixar Raytracer: 138,261,278 ticks [Pre-warmed]
Last edited: Jul 25, 2019

Arowx, Jul 25, 2019

#42

Shinyclef likes this.
Arowx

Joined:

Nov 12, 2009

Posts:

8,194
It looks like this is a bug in your code as it attempts to address the -1 index of the letters array in the Raytracer...

Code (CSharp):

Vector begin = MultiplyFloat(new Vector { x = letters[i] - 79.0f, y = letters[i - 1] - 79.0f, z = 0.0f }, 0.5f);
Arowx, Jul 25, 2019

#43

nxrighthere likes this.
nxrighthere

Joined:

Mar 2, 2014

Posts:

567

It was a happy little accident, now it's correct.

nxrighthere, Jul 26, 2019

#44
nxrighthere

Joined:

Mar 2, 2014

Posts:

567

Added Clang (LLVM 8) and IL2CPP (Unity 2019.1.11f1).

So, LLVM brother in some tests is slower than Burst but significantly faster in the raytracer since fast math optimizations are works properly there, almost as fast as GCC.

@xoofx You guys are using the original SLEEF for vectorized math? I wonder whether it's worth to link it natively and test against Unity's built-in math?

Last edited: Jul 26, 2019

nxrighthere, Jul 26, 2019

#45
nxrighthere

Joined:

Mar 2, 2014

Posts:

567

I think it would be better to sort benchmarks by integer/float/double categories and make 5 different algorithms/simulations for each (except double-precision).

nxrighthere, Jul 26, 2019

#46
Arowx

Joined:

Nov 12, 2009

Posts:

8,194
My attempt at optimising the PixarRaytracer...

Code (CSharp):

[BurstCompile(CompileSynchronously = true)]

//[BurstCompile(FloatPrecision.Standard, FloatMode.Fast)] // this is slower for me then just BurstCompile

private unsafe struct PixarRaytracerBurst : IJob

{

public uint width;

public uint height;

public uint samples;

public float result;

public NativeArray<float4> lettersBegin;

public NativeArray<float4> lettersE;

public void Execute()

{

result = PixarRaytracer(width, height, samples, lettersBegin, lettersE);

}

private uint marsagliaZ, marsagliaW;

private float PixarRaytracer(uint width, uint height, uint samples, NativeArray<float4> lb, NativeArray<float4> le)

{

lettersBegin = lb;

lettersE = le;

marsagliaZ = 666;

marsagliaW = 999;

float4 position = new float4 { x = -22.0f, y = 5.0f, z = 25.0f };

float4 goal = new float4 { x = -3.0f, y = 4.0f, z = 0.0f };

goal = Inverse(goal) + (position * -1.0f);

float4 left = new float4 { x = goal.z, y = 0, z = goal.x };

left = Inverse(left) * (1.0f / width);

float4 up = new float4(math.cross(goal.xyz, left.xyz), 0f);

float4 color = new float4();

float4 o = new float4();

uint y;

uint x;

uint p;

float colorPlus = 14.0f / 241.0f;

float invSamples = 1.0f / samples;

float4 goalleft = goal + left;

float xwidth;

float halfwidth = width / 2f;

float yheight;

float halfheight = height / 2f;

for (y = height; y > 0; y--)

{

yheight = y - halfheight;

for (x = width; x > 0; x--)

{

xwidth = x - halfwidth;

for (p = samples; p > 0; p--)

{

color = (color + Trace(position, (Inverse(((goalleft) * (xwidth + Random()))) + (up * (yheight + Random())))));

}

color = (color * invSamples + colorPlus);

o = color + 1.0f;

color = color / o;

color = color * 255.0f;

}

}

return color.x + color.y + color.z;

}

/*

[MethodImpl(MethodImplOptions.AggressiveInlining)]

private float3 Multiply(float3 left, float3 right)

{

left.x *= right.x;

left.y *= right.y;

left.z *= right.z;

return left;

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]

private float3 MultiplyFloat(float3 vector, float value)

{

vector.x *= value;

vector.y *= value;

vector.z *= value;

return vector;

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]

private float Modulus(float3 left, float3 right)

{

return left.x * right.x + left.y * right.y + left.z * right.z;

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]

private float ModulusSelf(float3 vector)

{

return vector.x * vector.x + vector.y * vector.y + vector.z * vector.z;

}

*/

[MethodImpl(MethodImplOptions.AggressiveInlining)]

private float4 Inverse(float4 vector)

{

return vector * 1f / math.length(vector);

}

/*

[MethodImpl(MethodImplOptions.AggressiveInlining)]

private float3 Add(float3 left, float3 right)

{

left.x += right.x;

left.y += right.y;

left.z += right.z;

return left;

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]

private float3 AddFloat(float3 vector, float value)

{

vector.x += value;

vector.y += value;

vector.z += value;

return vector;

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]

private float3 Cross(float3 to, float3 from)

{

to.x *= from.z - to.z * from.y;

to.y *= from.x - to.x * from.z;

to.z *= from.y - to.y * from.x;

return to;

}

*/

[MethodImpl(MethodImplOptions.AggressiveInlining)]

private float Min(float left, float right)

{

return left < right ? left : right;

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]

private float BoxTest(float4 position, float4 lowerLeft, float4 upperRight)

{

lowerLeft = (position + lowerLeft) * -1;

upperRight = ((upperRight + position) * -1);

return -Min(Min(Min(lowerLeft.x, upperRight.x), Min(lowerLeft.y, upperRight.y)), Min(lowerLeft.z, upperRight.z));

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]

private float Random()

{

marsagliaZ = 36969 * (marsagliaZ & 65535) + (marsagliaZ >> 16);

marsagliaW = 18000 * (marsagliaW & 65535) + (marsagliaW >> 16);

return (((marsagliaZ << 16) + marsagliaW) + 1.0f) * 3.141592653589793f;

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]

private float Sample(float4 position, int* hitType)

{

const int size = 15 * 4;

float distance = 1e9f;

float4 f = position;

f.z = 0.0f;

float4 begin;

float4 e;

float4 o;

for (int i = 0; i < size; i += 4)

{

begin = lettersBegin[i];

e = lettersE[i];

o =((f + ((begin + e) * Min(-Min(math.dot(((begin + f) * -1.0f), e) / math.dot(e,e), 0.0f), 1.0f)))* -1.0f);

//Vector o = MultiplyFloat(Add(f, MultiplyFloat(Add(begin, e), Min(-Min(Modulus(MultiplyFloat(Add(begin, f), -1.0f), e) / ModulusSelf(e), 0.0f), 1.0f))), -1.0f);

distance = Min(distance, math.dot(o,o));

}

distance = math.sqrt(distance);

float4* curves = stackalloc float4[2];

curves[0] = new float4 { x = -11.0f, y = 6.0f, z = 0.0f };

curves[1] = new float4 { x = 11.0f, y = 6.0f, z = 0.0f };

float m;

for (int i = 2; i > 0; i--)

{

o = (f + (curves[i] * -1.0f));

m = 0.0f;

if (o.x > 0.0f)

{

m = math.abs(math.length(o) - 2.0f);

}

else

{

if (o.y > 0.0f)

o.y += -2.0f;

else

o.y += 2.0f;

o.y += math.length(o);

}

distance = Min(distance, m);

}

distance = math.pow(math.pow(distance, 8.0f) + math.pow(position.z, 8.0f), 0.125f) - 0.5f;

*hitType = (int)Hit.Letter;

float roomDistance = Min(-Min(

BoxTest(position,

new float4 { x = -30.0f, y = -0.5f, z = -30.0f },

new float4 { x = 30.0f, y = 18.0f, z = 30.0f }),

BoxTest(position,

new float4 { x = -25.0f, y = -17.5f, z = -25.0f },

new float4 { x = 25.0f, y = 20.0f, z = 25.0f })),

BoxTest(new float4 { x = math.fmod(math.abs(position.x), 8), y = position.y, z = position.z },

new float4 { x = 1.5f, y = 18.5f, z = -25.0f },

new float4 { x = 6.5f, y = 20.0f, z = 25.0f }));

if (roomDistance < distance)

{

distance = roomDistance;

*hitType = (int)Hit.Wall;

}

float sun = 19.9f - position.y;

if (sun < distance)

{

distance = sun;

*hitType = (int)Hit.Sun;

}

return distance;

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]

private int RayMarching(float4 origin, float4 direction, float4* hitPosition, float4* hitNormal)

{

int hitType = (int)Hit.None;

int noHitCount = 0;

float distance = 0.0f;

float4 offsetX = new float4 { x = 0.01f, y = 0.0f, z = 0.0f };

float4 offsetY = new float4 { x = 0.0f, y = 0.01f, z = 0.0f };

float4 offsetZ = new float4 { x = 0.0f, y = 0.0f, z = 0.01f };

for (float i = 0; i < 100; i += distance)

{

*hitPosition = (origin + direction)* i;

distance = Sample(*hitPosition, &hitType);

if (distance < 0.01f || ++noHitCount > 99)

{

*hitNormal = Inverse(new float4 {

x = Sample((*hitPosition + offsetX), &noHitCount) - distance,

y = Sample((*hitPosition + offsetY), &noHitCount) - distance,

z = Sample((*hitPosition + offsetZ), &noHitCount) - distance });

return hitType;

}

}

return (int)Hit.None;

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]

private float4 Trace(float4 origin, float4 direction)

{

float4 sampledPosition = new float4 { x = 1.0f, y = 1.0f, z = 1.0f };

float4 normal = new float4 { x = 1.0f, y = 1.0f, z = 1.0f };

float4 color = new float4 { x = 1.0f, y = 1.0f, z = 1.0f };

float4 attenuation = new float4 { x = 1.0f, y = 1.0f, z = 1.0f };

float4 lightDirection = Inverse(new float4 { x = 0.6f, y = 0.6f, z = 1.0f });

float incidence;

float p;

float c;

float s;

float g;

float u;

float v;

Hit hitType;

for (int bounceCount = 3; bounceCount > 0; bounceCount--)

{

hitType = (Hit)RayMarching(origin, direction, &sampledPosition, &normal);

switch (hitType)

{

case Hit.None:

break;

case Hit.Letter:

{

direction = ((direction + normal) * math.dot(normal, direction) * -2.0f);

origin = ((sampledPosition + direction)* 0.1f);

attenuation = (attenuation * 0.2f);

break;

}

case Hit.Wall:

{

incidence = math.dot(normal, lightDirection);

p = 6.283185f * Random();

c = Random();

s = math.sqrt(1.0f - c);

g = normal.z < 0 ? -1.0f : 1.0f;

u = -1.0f / (g + normal.z);

v = normal.x * normal.y * u;

direction = ((new float4 { x = v, y = g + normal.y * normal.y * u, z = -normal.y * (math.cos(p) * s) }+

new float4 { x = 1.0f + g * normal.x * normal.x * u, y = g * v, z = -g * normal.x })

+ (normal* math.sqrt(c)));

origin = ((sampledPosition+ direction)* 0.1f);

attenuation = (attenuation * 0.2f);

if (incidence > 0 && RayMarching(((sampledPosition + normal)* 0.1f), lightDirection, &sampledPosition, &normal) == (int)Hit.Sun)

color = (((color + attenuation) * new float4 { x = 500.0f, y = 400.0f, z = 100.0f })* incidence);

break;

}

case Hit.Sun:

{

color = ((color + attenuation) * new float4 { x = 50.0f, y = 80.0f, z = 100.0f });

goto escape;

}

}

}

escape:

return color;

}

}

You now need to launch it like this:

Code (CSharp):

{

NativeArray<float4> lb;

NativeArray<float4> le;

SetupLetters(out lb, out le); // pre-calculating letter data to vector information.

var pixarRaytracerBurst = new PixarRaytracerBurst

{

width = 720,

height = 480,

samples = pixarRaytracer,

lettersBegin = lb,

lettersE = le

};

pixarRaytracerBurst.Run();

stopwatch.Restart();

pixarRaytracerBurst.Run();

time = stopwatch.ElapsedTicks;

Debug.Log("(Burst) Pixar Raytracer: " + time + " ticks");

}

Note the changing out of Vector to float4 the conversion of a lot of inline functions to math.<functions> and probably the largest change the addition of NativeArrays to store the Sample functions letters pre-calculated vectors.

Score before changes:

(Burst) Pixar Raytracer: 167,867,050 ticks

Score after changes:

(Burst) Pixar Raytracer: 138,667,282 ticks

Even with these changes I still think this is a poor benchmarking algorithm due to the Sample function constantly re-calculating 'ray' to letter proximity without any spatial optimisation.
Arowx, Jul 26, 2019

#47

tigerleapgorge likes this.
xoofx

Unity Technologies

Joined:

Nov 5, 2016

Posts:

417

Arowx said: ↑

Even with these changes I still think this is a poor benchmarking algorithm due to the Sample function constantly re-calculating 'ray' to letter proximity without any spatial optimisation.
Click to expand...

This is not important for comparing codegen between compiler on the same codebase. The point here is less about making the faster ray calculation code but to compare how a simple code behaves in burst and other compilers. I would like to compare - almost - apple to apple here, where we don't change burst C# code in regard to the C version. That way, we can track anything that could be different between our codegen for C# and C++

nxrighthere said: ↑

Done, you can find the diff with numbers here. Noticeable changes: FloatPrecision.Standard, FloatMode.Fast improved the Mandelbrot performance by 52%, it's now very close to GCC. Fast float mode makes the raytracer up to 91% slower than default one on my machine, so I haven't used it there. Everything else remains almost the same. NBody with double-precision floating-point unaffected.
Click to expand...

This is surprising. We will have a look. It could be that we might have a bug here in burst

We have around in burst codebase a benchmark suite for comparing C# vs C++ code, but we never had a chance to work on practical cases, so @nxrighthere, if you don't mind, we will gladly integrate your benchmarks as part of our suite. This is exactly the kind of benchmarks that were meant to be put in our suite.

xoofx, Jul 26, 2019

#48

Saiffyros, Ofx360, MadeFromPolygons and 5 others like this.
nxrighthere

Joined:

Mar 2, 2014

Posts:

567

xoofx said: ↑

if you don't mind, we will gladly integrate your benchmarks as part of our suite. This is exactly the kind of benchmarks that were meant to be put in our suite.
Click to expand...

It would be a pleasure, I'm happy to provide even more tests.

nxrighthere, Jul 26, 2019

#49

MadeFromPolygons, RaL and Shinyclef like this.
Shinyclef

Joined:

Nov 20, 2013

Posts:

505

These tests are fantastic. I'm going to have to find some time at some point to read through more careful and understand the key learnings about what bust likes and doesn't like.

Shinyclef, Jul 26, 2019

#50

nxrighthere likes this.

Page 1 of 4

Thread Status:: Not open for further replies.