Search Unity

Question Reinterpret NativeArray<int> to <v128>?

Discussion in 'Burst' started by Trindenberg, Oct 16, 2022.

  1. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    398
    Code (CSharp):
    1. numbers = numbersIn.Reinterpret<v128>(sizeof(int) * 4)
    2.  
    3. // numbers - NativeArray<v128>
    4. // numbersIn - NativeArray<int>
    InvalidOperationException: Type System.Int32 was expected to be 16 but is 4 bytes

    Thought this was going to be a walk in the park and just changes the pointer size to 16 bytes? Hope I don't have to manually load ints into v128!
     
  2. Spy-Master

    Spy-Master

    Joined:
    Aug 4, 2022
    Posts:
    638
    https://docs.unity3d.com/ScriptReference/Unity.Collections.NativeArray_1.Reinterpret.html
     
  3. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    398
    I did read that more, and changed to (sizeof(int)) but:

    On a test array of 24 ints, numbers now has a length of 24 when it needs to be 6 x v128.

    Assuming a NativeArray simply holds a pointer and a length, how do I change pointer size from 4 to 16 and the length from 24 to 6? Easily done with void pointers, not sure how in Unity. Trying to do things the safe way!
     
  4. Spy-Master

    Spy-Master

    Joined:
    Aug 4, 2022
    Posts:
    638
    Code (CSharp):
    1.     NativeArray<int> array1 = new NativeArray<int>(24, Allocator.Temp);
    2.     NativeArray<v128> array2 = array1.Reinterpret<v128>(sizeof(int));
    3.     Debug.Log($"{array1.Length} {array2.Length}");
    upload_2022-10-16_3-33-7.png
     
    Trindenberg likes this.
  5. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    398
    Ok I got further, but here's what I don't understand. Without the commented lines to get the memory out of the loop it takes 300 ticks. With the line running so I can get the sum out... it takes 5000000 ticks....

    I'm assuming 0+1 is much faster than 1000000+1, this can be the only reason, but why, more bits to add up?

    Code (CSharp):
    1.         public void Execute(int i)
    2.         {
    3.             v128
    4.                 lefty = numbersV[i],
    5.                 right = new(),
    6.                 tally = new(),
    7.                 tTally = new();
    8.  
    9.             for (int idx = numbersV.Length-1; idx > -1; idx--)
    10.             {
    11.                 right = numbersV[idx];
    12.  
    13.                 v128 s2 = SSE2.shuffle_epi32(right, _1230);
    14.                 v128 s3 = SSE2.shuffle_epi32(right, _2301);
    15.                 v128 s4 = SSE2.shuffle_epi32(right, _3012);
    16.  
    17.                 v128 c1 = SSE2.cmpgt_epi32(lefty, right);
    18.                 v128 c2 = SSE2.cmpgt_epi32(lefty, s2);
    19.                 v128 c3 = SSE2.cmpgt_epi32(lefty, s3);
    20.                 v128 c4 = SSE2.cmpgt_epi32(lefty, s4);
    21.  
    22.                 v128 t1 = SSE2.add_epi32(c1, c2);
    23.                 v128 t2 = SSE2.add_epi32(c3, c4);
    24.                 v128 t3 = SSE2.add_epi32(t1, t2);
    25.                
    26.                 tally = SSE2.add_epi32(tTally, t3);   // Empty tally
    27.                 //tally = SSE2.add_epi32(tally, t3);  // += tally
    28.             }
    29.  
    30.             numbersOut[i] = tally;
     
  6. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    398
    Or is some crazy safety check based on the size? I thought 0+1 is no different from 1000000+1 in speed (well actually these output -1 so it would be -1000000+-1)
     
  7. Neto_Kokku

    Neto_Kokku

    Joined:
    Feb 15, 2018
    Posts:
    1,751
    When you use line 26 instead of line 27, each iteration of your loop completely replaces the value of tally without using anything from the previous iteration. This means only the last iteration does any actual work and the rest is useless.

    The massive difference in timings suggests the burst compiler noticed this too and optimized things so the loop doesn't exist anymore: it only needs to do the last iteration to obtain the same result (with idx = 0). The actual work is done only n times for an input array of size of n.

    With line 27 instead of 26, the amount of work for n items is n², an exponential increase as your items list grows in size.
     
    Trindenberg likes this.
  8. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    398
    Ok that makes sense. Well no idea how intrinsics here can make things faster, seems pretty slow unless I'm missing something (safety checks off). I'm also realising now how inputs/outputs are backwards. Intrinsics is one thing, thinking backwards is another! But at least I made a load of consts backwards for shuffling (after realising they need a backwards control). Maybe I can use that part for re-arranging.
     
  9. vectorized-runner

    vectorized-runner

    Joined:
    Jan 22, 2018
    Posts:
    398