Bug Vectorized Trig Functions Results Differ

chadfranklin47 · May 9, 2023

Hello, I am doing some noise generation, taking advantage of the Burst compiler, and it works (at least usually) great for batch workloads. I process 32 floats per loop and usually get nice vectorized code. I also have a fallback scalar evaluation function that processes 1 float per loop for algorithms that aren't suitable for batch. A noise function should return the same output when given the same input and I would like to ensure that the result is the same for an input whether the input is passed to the vectorized batch evaluation function or the scalar evaluation function. It is usually not too difficult to achieve equal results between the batch evaluation function and the scalar evaluation function. It is a different story with sin & cos.

The values returned from sin & cos in the batch evaluation function differ slightly from those returned from the scalar function. The only method I have found to produce equal results from the functions is to compile with FloatMode.Deterministic. The issue with this solution is calls to sin & cos are no longer auto-vectorized by Burst, instead reverting to calling sin/cos 32 times. With manual vectorization (using float4) the results remain equivalent to the scalar function while being more performant, though unfortunately, there doesn't seem a way to manually vectorize sin & cos up to Avx instructions (8 floats). This leaves me with a few questions.

Is it expected behavior that the results from sin & cos differ when vectorized vs not vectorized? Is there a better way to achieve equal results from the scalar and batch functions? Would it be possible to add Avx intrinsic functions for sin, cos, etc.? Is there something I am overlooking? Thanks.

tim_jones · May 9, 2023

Hi @chadfranklin47!

chadfranklin47 said: ↑

Is it expected behavior that the results from sin & cos differ when vectorized vs not vectorized?
Click to expand...

Yes, that is expected. The vectorized / not-vectorized code paths use slightly different algorithms, when not using
FloatMode.Deterministic
.

chadfranklin47 said: ↑

Is there a better way to achieve equal results from the scalar and batch functions?
Click to expand...

The underlying SLEEF library that we use does have deterministic vector versions of e.g.
sin
and
cos
. We'd have to look more into it to investigate why it's not using those, and instead falling back to a scalar codepath when using
FloatMode.Deterministic
. If you're in a position to create a small project with code that reproduces this problem, and log a bug via Help > Report a Bug, that would help us.

chadfranklin47 said: ↑

Would it be possible to add Avx intrinsic functions for sin, cos, etc.?
Click to expand...

That's something that's been requested before - exposing the low-level SLEEF vectorized trigonmetry functions. At this point it's just something that's on our radar, but isn't planned.

MarcoPersson · May 9, 2023

Hi @chadfranklin47, to add to what Tim said:
Have you tried compiling your code with FloatPrecision.High?
That should make the resulting code use higher precision vectorized sin calls (10 ULP instead of 35 ULP in this case). Though you might still see a difference between the Bursted and non-Bursted version?

chadfranklin47 · May 10, 2023

Thank you both for your replies.

tim_jones said: ↑

That's something that's been requested before - exposing the low-level SLEEF vectorized trigonmetry functions. At this point it's just something that's on our radar, but isn't planned.
Click to expand...

@tim_jones What I was thinking here wouldn't be exposing the SLEEF functions themselves, as that would put the burden on the caller to decide which version to use, but rather a "simple" Avx intrinsic function that does the same thing as float4 sin/cos in calling whichever version Burst deems fit.

tim_jones said: ↑

If you're in a position to create a small project with code that reproduces this problem, and log a bug via Help > Report a Bug, that would help us.
Click to expand...

Will be done asap.

@MarcoPersson Just tried it now and it does lessen the average difference between the batch and scalar functions by a factor of ~10, but in doing so, also lessens the performance. I am actually looking to use FloatPrecision.Low here as I am more concerned about performance. I just need the performant code to return the same results between the batch and scalar versions. Also, to clarify, both the scalar and batch evaluation functions are Burst compiled but, of course, only the batch version is vectorized.

chadfranklin47 · May 28, 2023

I have gone ahead and filed a bug report for this issue: IN-42241

Search Unity

Bug Vectorized Trig Functions Results Differ

chadfranklin47

tim_jones

Unity Technologies

MarcoPersson

Unity Technologies

chadfranklin47

chadfranklin47

Search Unity

Unity ID

Useful Searches

Bug Vectorized Trig Functions Results Differ

chadfranklin47

tim_jones

Unity Technologies

MarcoPersson

Unity Technologies

chadfranklin47

chadfranklin47