Search Unity

Bug Vectorized Trig Functions Results Differ

Discussion in 'Burst' started by chadfranklin47, May 9, 2023.

  1. chadfranklin47

    chadfranklin47

    Joined:
    Aug 11, 2015
    Posts:
    229
    Hello, I am doing some noise generation, taking advantage of the Burst compiler, and it works (at least usually) great for batch workloads. I process 32 floats per loop and usually get nice vectorized code. I also have a fallback scalar evaluation function that processes 1 float per loop for algorithms that aren't suitable for batch. A noise function should return the same output when given the same input and I would like to ensure that the result is the same for an input whether the input is passed to the vectorized batch evaluation function or the scalar evaluation function. It is usually not too difficult to achieve equal results between the batch evaluation function and the scalar evaluation function. It is a different story with sin & cos.

    The values returned from sin & cos in the batch evaluation function differ slightly from those returned from the scalar function. The only method I have found to produce equal results from the functions is to compile with FloatMode.Deterministic. The issue with this solution is calls to sin & cos are no longer auto-vectorized by Burst, instead reverting to calling sin/cos 32 times. With manual vectorization (using float4) the results remain equivalent to the scalar function while being more performant, though unfortunately, there doesn't seem a way to manually vectorize sin & cos up to Avx instructions (8 floats). This leaves me with a few questions.

    Is it expected behavior that the results from sin & cos differ when vectorized vs not vectorized? Is there a better way to achieve equal results from the scalar and batch functions? Would it be possible to add Avx intrinsic functions for sin, cos, etc.? Is there something I am overlooking? Thanks.
     
    Last edited: May 9, 2023
  2. tim_jones

    tim_jones

    Unity Technologies

    Joined:
    May 2, 2019
    Posts:
    287
    Hi @chadfranklin47!

    Yes, that is expected. The vectorized / not-vectorized code paths use slightly different algorithms, when not using
    FloatMode.Deterministic
    .

    The underlying SLEEF library that we use does have deterministic vector versions of e.g.
    sin
    and
    cos
    . We'd have to look more into it to investigate why it's not using those, and instead falling back to a scalar codepath when using
    FloatMode.Deterministic
    . If you're in a position to create a small project with code that reproduces this problem, and log a bug via Help > Report a Bug, that would help us.

    That's something that's been requested before - exposing the low-level SLEEF vectorized trigonmetry functions. At this point it's just something that's on our radar, but isn't planned.
     
    chadfranklin47 likes this.
  3. MarcoPersson

    MarcoPersson

    Unity Technologies

    Joined:
    Jul 21, 2021
    Posts:
    53
    Hi @chadfranklin47, to add to what Tim said:
    Have you tried compiling your code with FloatPrecision.High?
    That should make
    the resulting code use higher precision vectorized sin calls (10 ULP instead of 35 ULP in this case). Though you might still see a difference between the Bursted and non-Bursted version?
     
    chadfranklin47 likes this.
  4. chadfranklin47

    chadfranklin47

    Joined:
    Aug 11, 2015
    Posts:
    229
    Thank you both for your replies.

    @tim_jones What I was thinking here wouldn't be exposing the SLEEF functions themselves, as that would put the burden on the caller to decide which version to use, but rather a "simple" Avx intrinsic function that does the same thing as float4 sin/cos in calling whichever version Burst deems fit.

    Will be done asap.

    @MarcoPersson Just tried it now and it does lessen the average difference between the batch and scalar functions by a factor of ~10, but in doing so, also lessens the performance. I am actually looking to use FloatPrecision.Low here as I am more concerned about performance. I just need the performant code to return the same results between the batch and scalar versions. Also, to clarify, both the scalar and batch evaluation functions are Burst compiled but, of course, only the batch version is vectorized.
     
    Last edited: May 10, 2023
  5. chadfranklin47

    chadfranklin47

    Joined:
    Aug 11, 2015
    Posts:
    229
    I have gone ahead and filed a bug report for this issue: IN-42241
     
    tim_jones likes this.