Search Unity

Discussion Burst inspiration

Discussion in 'Burst' started by Trindenberg, Oct 14, 2022.

  1. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    Hi,

    Love messing about with code trying make things fast, just looking for inspiration on what I can try next?

    Currently Burst Mandelbrot (without Compute) but must be some other things I can try so reaching out for inspiration. :)
     
  2. Antypodish

    Antypodish

    Joined:
    Apr 29, 2014
    Posts:
    10,769
    Make some small game.
    Your goal: is to animate 1mln of things, with minimum 60FPS.

    Show us performance before and after burst and SIMD optimization.
    Show us what have you changed, to increase performance.
    Ideally, showing burst compiled code, for optimized lines of code.
     
    Trindenberg likes this.
  3. mgear

    mgear

    Joined:
    Aug 3, 2010
    Posts:
    9,408
    Niter88 and Trindenberg like this.
  4. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,264
    Disregarding the stupid things in projects that bring everything to a crawl, the biggest performance culprit in real projects is going to algorithms with greater than O(n) complexity. Those are especially valuable for optimizing.

    So rather than making a million things move, think about 10,000 things each interacting with the 9999 others. XPBD simulations, or even just a general purpose collision detection broadphase are good candidates.

    Sorting algorithms are also a great candidate. They are especially important for recovering determinism after running algorithms in parallel jobs.

    You could also get into really data-heavy algorithms such as audio DSP work. Or pathfinding.

    I guess the biggest question I have is how practical do you want to be?
     
    xVergilx, Antypodish and Trindenberg like this.
  5. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    I decided to work on a sorting algorithm. Also tried intrinsics with that, but I think I'll give that a miss for now!

    Audio DSP, not sure how you would do real time without making native plugins and that's beyond me at the moment! One of my older projects was actually lossless audio compression, maybe that's something I could update to Burst one day. Suppose Burst would be great for a lot of things de/compression and de/serialisation?
     
  6. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    Always trying to crunch items/fps, but would this be something to work with ECS or can performance be achieved drawing indirect? I've little experience on ECS but will get into it when it's a stable release.

    The project I've been stuck in for ages (turning into a test ground for many things) was originally about using audio to define a landscape, and then creating a per frame localised collider based on position (collider only needs to exist where there are physical objects). One of those random idea's that turns into many others! I suppose simply, I could use quads as platforms (rather than a whole mesh, but maybe). I'm just trying to think how to make audio look good in 3 dimensions, kind of tricky really. I did crunch Keirijo's code on BurstFFT and make it faster. I could either make a slice per frame (which is 2 dimension) or think of some way to make a spectrum 3D. I suppose Octave is another dimension. I'll keep these formulating in my brain anyway!
     
  7. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    Here was my original messing about for 3D audio, but this is with a 2D slice/construct (also based on AudioSource.GetSpectrum with is not native). Maybe Burst has the capacity to update a lot of verts but not sure (cost is in reload), probably a ComputeShader thing really. There's no lighting in this but got a nice effect just with colour! With more complexity, might look quite interesting.

     
  8. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,264
    No plugin necessary. If you wanted, you can play with this code right here. It is a simple linear interpolator that initially looks easy to vectorize, but is actually incredibly difficult. I don't remember which version Burst started auto-vectorizing it, but it was one of the more recent versions. I was really impressed when Burst started doing it, but despite using AVX, it only sped things up by 2X.

    Anyways, if you want to explore audio more, I can definitely help you get started.

    Visualization looks awesome by the way!

    Edit: What happened to the link? Oh well: https://github.com/Dreaming381/Latios-Framework/blob/v0.5.7/MyriAudio/Jobs/Sampling.cs#L105-L121
     
    Last edited: Oct 18, 2022
    Trindenberg likes this.
  9. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    In regards to my sorting job; I have some questions on multi-threaded tasks.

    Is it efficient to compare/copy in a multithread way, as in, if I'm comparing 4 massive separate arrays (only with themselves) in multi-thread, is the read/write going to be any faster?

    If these are single index + for loop jobs, how can I schedule these in parallel, and is it worth it (since they would be fighting for fast cache rather than sequenced in one row).

    t1 t2 t3 t4
    [ A1 ] [ A2 ] [ A3 ] [ A4 ]
    [ B1 ] [ B2 ] [ B3 ] [ B4 ]

    Have a theory that might just work (probably already exists!), by comparing bits x32 although with floats will be a little complex with the structure.

    upload_2022-10-18_22-33-59.png

    Also, wondering why this reads backwards, I thought a float/int would go the other way? But to get the first byte of the float, I have to refer to byte[3]

    upload_2022-10-18_22-40-43.png
     
  10. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,264
    If the parallel scheduling cost is negligible relative to the workload, then the worst that can happen is you become 100% memory bound and your multi-threaded reads have nearly the same performance as a single thread. It is only writing where you have to worry about false sharing, but in practice, I have yet to see a parallel job choke on this in Unity even with some crazy access patterns. I suspect due to the safety rules Unity imposes that pending writes on shared cache lines just get stacked in the store buffer, and since those cache lines are usually only written to and not read from, it doesn't slow things down at all.

    Best case scenario is each thread works on a little smaller than L2-sized block and every read hits L2 cache.

    It is called "endianness".

    So perhaps I should have elaborated more on this problem when I proposed sorting to you.

    In practice, very rarely is it valuable to sort one large list with a million elements. At max I think I have ever needed to sort about 50,000 for a command buffer that was written to in parallel and used an entityInQueryIndex sortKey (this is an ECS thing). What is a lot more common is having to sort results after a parallel bucketing algorithm. For example, you might need to sort all the values associated with a given key in a NativeParallelMultiHashMap. One of my use cases is for collision detection I use a coarse-resolution spatial hash and then sort objects in each hash bucket.

    So in real use cases, I am much more likely to sort 50 buckets of 1000 objects each or 1000 buckets of 50 objects each (or some mix of sizes) rather than one large list of 50000 objects. The problem is already naturally parallelizable so the biggest performance wins will come from the algorithm and instructions generated.

    Usually for very small counts, a sorting network is optimal. For slightly larger, there's insertion sort. For larger still, up to 300 - 1000 items depending on your key size, quick sort typically wins. But then above that there's radix sort. Then one has to consider the overhead of dynamically choosing the appropriate algorithm, how to manage a payload associated with the key if there is one and if so how big, and then some situations require sorting to be stable which can invalidate certain techniques for certain size ranges.

    This is the reason why I asked "how practical do you want to be?" because a lot of parameters worth considering can only be derived from the use case. And picking real use cases is hard if you don't have a real project with real performance bottlenecks. That's also one of the many reasons why I try to make my projects and experiments open source so that people who don't have such a project but still want to experiment have something to work off of.
     
  11. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    @DreamingImLatios I did check out the various sorting algorithms that exist and didnt take time to really understand them, i like going on my own tangent of discovery even if its leads to failure. Always learning things along the way.

    I try to keep code simple and english. C# code is often split so much it reminds me of why apparently c++ macros are bad. The difference though is you can write your own englsh macros. I so wish c# had a #define feature.

    I like to be practical, applying it to use cases is another thing. But maybe i make something useful one day! Guess that is why I was looking for inspiration. Going with the flow of change is another thing too. Trying to embrace what exists at the same time as taking my own stubborn route hehe!
     
  12. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    Well here's my current sorting algorithm results roundabouts (slightly less than my predictions hehe), which I'm quite happy with. For me, messing around is learning more! Based on pre-warm up, 1000 loops.

    upload_2022-10-28_0-53-46.png

    One interesting factor is, pointer++/-- doesn't seem to work the same way as it would normally, although that normally would be inside a fixed statement. I get some items not sorting, or crashes. But not sure what the difference is to setting it?

    Code (CSharp):
    1.                         for (int j = 1; j < size + 1; j++)
    2.                         {
    3.                             ptr = &number[Ascender + j];
    4.                             // ptr++; **Unstable**
    5.                             if (ptr[0] > ptr[1]) (ptr[0], ptr[1]) = (ptr[1], ptr[0]);
    6.                         }
    7.  
     
  13. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,264
    Well compared to the code snippet you have now, it seems your loop control logic must have also changed as well, so it was probably something else in your loop control that was off by one.

    Though on x86, not using increment operators may actually be beneficial, as the way the code is written now, I could see Burst caching the address of &number[Ascender] and then using [A + kB] addressing. Hard to say without seeing full code though.
     
  14. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    @DreamingImLatios It may be due to just running the code. My results above were wrong because I was using Execute for all (not just on lower arrays), I've now split into Quick, and the longer arrays +16 use Run giving 3-4x.

    upload_2022-10-28_8-51-57.png

    Here's the Quick code. Length 16 seems to be the tricky one to enhance using either.

    Code (CSharp):
    1.         public unsafe void Quick()
    2.         {
    3.             float* number = (float*)NativeArrayUnsafeUtility.GetUnsafeBufferPointerWithoutChecks(numbers);
    4.  
    5.             float* ptr = number;
    6.  
    7.             int size = numbers.Length - 2;
    8.  
    9.             if (ptr[0] > ptr[1]) (ptr[0], ptr[1]) = (ptr[1], ptr[0]);
    10.             for (int i = 0; i < size; i++)
    11.             {
    12.                 for (int j = 1; j < size + 1; j++)
    13.                 {
    14.                     ptr = &number[j];
    15.                     if (ptr[0] > ptr[1]) (ptr[0], ptr[1]) = (ptr[1], ptr[0]);
    16.                 }
    17.                 for (int k = size; k > -1; k--)
    18.                 {
    19.                     ptr = &number[k];
    20.                     if (ptr[0] > ptr[1]) (ptr[0], ptr[1]) = (ptr[1], ptr[0]);
    21.                 }
    22.             }
    23.         }
     
  15. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    I noticed I can optimise this further I think, it was a work in progress on top of my main code which uses bitshifts and branches.
     
  16. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,264
    Is Native Sort Unity's sorting implementation? And is all of this running in Burst without safety checks?

    I guess I'm just surprised to see a cocktail sort beat out a quick sort. Though for 16 elements or fewer, Unity's sort uses Insertion Sort, which is a more competitive sort though I suspect Unity's implementation may have some unnecessary overhead.
     
  17. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    Well I found NativeSort exists so worked with that, not sure what else exists. All safety checks off. Not really following any algorithm just using my head and figuring how different ideas work. Currently about idea 20!

    Also nice to know about the tuple swap, so tidy.
     
  18. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    upload_2022-10-28_23-17-8.png

    Here's my current results. Would love to understand why 256-512 gives almost 6x performance, must be something to do with the overall size. Will share my branching code when I'm happy with it. But the 0-16 code (and also 0-8 branch code) is below. If this can be enhanced further let me know! :)

    Code (CSharp):
    1.         public unsafe void Quick()
    2.         {
    3.             float* number = (float*)NativeArrayUnsafeUtility.GetUnsafeBufferPointerWithoutChecks(numbers);
    4.  
    5.             float* ptr = number;
    6.  
    7.             int lhs = -1, rhs = numbers.Length - 2;
    8.             int check = rhs / 2; // Rounds down
    9.  
    10.             while (lhs < check)
    11.             {
    12.                 for (int up = ++lhs; up != rhs + 1; up++)
    13.                 {
    14.                     ptr = &number[up];
    15.                     if (ptr[0] > ptr[1]) (ptr[0], ptr[1]) = (ptr[1], ptr[0]);
    16.                 }
    17.  
    18.                 for (int down = --rhs; down != lhs - 1; down--)
    19.                 {
    20.                     ptr = &number[down];
    21.                     if (ptr[0] > ptr[1]) (ptr[0], ptr[1]) = (ptr[1], ptr[0]);
    22.                 }
    23.             }
    24.             if (rhs - lhs != 0)
    25.             {
    26.                 ptr = &number[lhs+1];
    27.                 if (ptr[0] > ptr[1]) (ptr[0], ptr[1]) = (ptr[1], ptr[0]);
    28.             }
    29.         }
    30.  
     
  19. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,264
    Trindenberg likes this.
  20. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    Will have to give it a proper read. I did attempt intrinsics but it was very slow, most likely my error! Figured Burst probably does a better job of optimising than I can think of unless I really understood how to use intrinsics in a more optimal way than the compiler can do with code. Not sure where my code fits into O(n) as it depends on branches. Basically per bit depth sort left and right, left becomes an additional branch until right becomes a dead end and jumps to the next branch. Took me 2 weeks to try and translate it from my brain! Just need to add the initial split for signed types as they would be reversed. Maybe for larger arrays I could Schedule 4 parallel but no idea if there would be a performance gain from doing that (would be operating 4 separate parts of the array).
     
  21. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    I did give that a read and went back to one of my initial ideas based on counting positions, updated that for 16-32. Also seems to run faster using Span compared to NativeArray for the temp numbers. Not sure how I could make that work better without Span (but assume NA uses Span somewhere).


    Code (CSharp):
    1.         public unsafe void Quick_16_32()
    2.         {
    3.             int
    4.                 index,
    5.                 length = numbers.Length,
    6.                 idx = 0;
    7.  
    8.             float*
    9.                 number = (float*)NativeArrayUnsafeUtility.GetUnsafeBufferPointerWithoutChecks(numbers),
    10.                 end = number + length,
    11.                 numA = number - 1,
    12.                 numB,
    13.                 tNum = stackalloc float[length];
    14.  
    15.             Span<float>
    16.                 tNum2 = new Span<float>(tNum, length),
    17.                 tNum3 = new Span<float>(number, length);
    18.  
    19.             while (++numA != end)
    20.             {
    21.                 index = idx;
    22.                 numB = number;
    23.  
    24.                 while (numB != numA) if (*numA < *numB++) index--;
    25.                 numB++;
    26.                 while (numB != end) if (*numA > *numB++) index++;
    27.  
    28.                 tNum2[index] = *numA;
    29.                 idx++;
    30.             }
    31.  
    32.             tNum2.CopyTo(tNum3);
    33.         }
    Also managed to parallelise my main sort at 2048+, but only used 4 branches. Might try 8 but not sure it will make much difference, but definitely great on larger arrays! The overall code is slower now as I wrapped it in a method. Whenever I break things down into methods I get a slowdown, hence the NativeSort looks like it has many nested methods within methods.

    upload_2022-10-30_0-54-58.png

    Just for testing at the moment. I still need to complete my branching job and tidy it up and then I'll share that.

    Code (CSharp):
    1.     [BurstCompile]
    2.     public static unsafe void TSort(NativeArray<float> na)
    3.     {
    4.         var len = na.Length;
    5.  
    6.         if (len == 2)
    7.         {
    8.             float* n = (float*)NativeArrayUnsafeUtility.GetUnsafeBufferPointerWithoutChecks(na);
    9.             if (*n > n[1]) (*n, n[1]) = (n[1], *n);
    10.         }
    11.         else if (len < 9)
    12.         {
    13.             new FastSortJob5e()
    14.             {
    15.                 numbers = na
    16.             }
    17.             .Quick_8();
    18.         }
    19.         else if (len < 33)
    20.             new FastSortJob5e()
    21.             {
    22.                 numbers = na
    23.             }
    24.             .Quick_16_32();
    25.         else if (len < 1025)
    26.             new FastSortJob5e()
    27.             {
    28.                 numbers = na
    29.             }
    30.             .Run();
    31.         else
    32.         {
    33.             NativeArray<BitBranch2> br = new NativeArray<BitBranch2>(4, Allocator.Temp);
    34.             new FastSplitterJob5e()
    35.             {
    36.                 numbers = na,
    37.                 branches = br
    38.             }.Run();
    39.  
    40.             new FastSortJob5eParallel()
    41.             {
    42.                 numbers = na,
    43.                 branchesIn = br
    44.             }.Schedule(4, 1).Complete();
    45.         }
    46.     }
     
    DreamingImLatios likes this.
  22. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    Wasn't too much to implement 8 sort arrays, a marginal improvement considering 2 extra cores used. So 131072 numbers in 0.52ms, no idea what that compares to!

    upload_2022-10-30_1-30-18.png
     
  23. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    No idea if this has any uses, but based on number of unique integers (non-parallel at this stage):

    2 items:

    upload_2022-11-3_20-27-38.png

    32 items:

    upload_2022-11-3_20-31-54.png
     
  24. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,264
    If by that you mean that all integers in the array have to be unique, and you are exclusively sorting integers, then there is one use case I know of. Freelists contain a sequence of indices in another array which are "unused". If you need to perform a compaction on the element array, sorting the freelist in ascending order makes the compaction process a linear operation. Additionally, if you sort in descending order, then your freelist will temporarily prioritize lower free indices for allocation which can be advantageous.

    A second use case is for sorting a 16 bit key and 16 bit value pair. This can be accomplished by shifting the key left by 16 and concatenating with the value, then sorting. The value bits shouldn't effect the end result (though depending on the algorithm, sorting them could be unnecessary work). Though if your algorithm could be extended to support a 32 bit key and 32 bit value, that covers a lot more use cases.
     
    Trindenberg likes this.
  25. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    @DreamingImLatios By unique I mean number of unique values (multiples of). So for example, maybe you have red, green, blue, black, white items, and you want to group them together in some order. This concept seems great for that! I removed the unnecessary bits to check per branch. So if you mean ordering with say indice 0/unused at the start/end I guess that would work.

    You will have to explain more the scenario, I'm more visual really in my thinking! I did think about the concept of ordering closest points, by sorting x y z, and using as a starting point for checks. But not sure if that would work. :) Also wondering if it's possible to Run a single thread job, and Schedule other threads from the job, also whether there is a likely cost per Schedule, rather than sorting say 8 segments and then Scheduling 8 items in one call at the end - no idea!
     
  26. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,264
    Oh. That's a very different use case, though one I have encountered in practice. I have a course-grained 3D grid with usually somewhere between 10 and and 258 cells. In a parallel job, I assign each element a cell index (multiple elements can share an index). Then I use a single-threaded counting sort to get the elements all grouped together.

    Code is here. You'll also notice I am doing a value -> rank conversion, as the payload size for this algorithm is costly to copy around. https://github.com/Dreaming381/Lati...ders/BuildCollisionLayerInternal.cs#L362-L393
     
  27. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    @DreamingImLatios So if I understand correctly, the collision layers are increasingly smaller containers of object/entity indexes? Axis layers? You have more experience on all this than me! I just like messing around with numbers and speed!
     
  28. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,264
    So for more context, a CollisionLayer is a an acceleration structure of collider data, usually with a layer representing a particular type of object like a bullet. Inside one of these layers is a low-resolution grid where colliders get binned into, and then inside each cell of the grid I do additional sorting of elements by their floating point x axis value using a radix sort. That gets used for a single-axis sweep and prune algorithm which I've spent a lot of time optimizing.
     
  29. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    I did check out your post on this, looked like an interesting and efficient concept although my knowledge on AABB stuff is lacking. I was thinking about your key/value setup, I assume you would want 32bit for more object indices. But wondering why you need to shift them into a 64bit. Wouldn't you just interpret as 2 x 32bit with 64bit offset, ie. a struct union or such? Say a struct { [0] int key, [4] float value, [0] ulong pair } or would that be 4, 0 I was somewhat confused on the endianness you mentioned before, and where it comes into play (I know so far memory writes bytes backwards something like that!) Or is it due to 2 alignments of 32, added together into a new location/process?

    Also, was going to share some Burst Inspector. While this is very efficient in code, this seems like alot going on for a small task! Some of the instructions are 128bit so does that mean something is being vectorised, or it just used them anyway? Sorry few questions here!

    bptr is a uint* to each float item, end is array+1* I stuck to float it seemed faster even for whole numbers, although weighed out probably by casting to int to float outside of the test.

    Basically it isolates to only the bits that change, therefore not needing to look at the other bits. My version earlier (with a more basic setup) is now much faster x60 on 2 unique items, about x10 on 256 unique items.

    At the moment this part happens initially, and then happens while sorting branches (lower/upper + A/B). But this is just 4 lines of code on this inspector, crazy!

    upload_2022-11-4_21-27-20.png
     
  30. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    4,264
    If your sort function accepted a NativeArray<ulong>, then I could use the shift and pack strategy to sort a 32 bit key and 32 bit value pair. In some cases, I could just pack them in a struct union, but the shifting strategy is platform-agnostic and Burst can usually see through it and convert it into two move operations if that is optimal.

    That is Burst doing a very Burst thing, which is autovectorizing your code using AVX to perform operations on 8 elements at a time, and then unrolling 4 of those AVX operations in sequence for a total of 32 elements per loop iteration with just 19 instructions. Then for the remainder it takes 7 instructions per element. Burst did you a solid, which is a huge part of why you are seeing such good performance.
     
    Trindenberg likes this.
  31. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    What I learnt so far is to compact as much as possible, and more local variables to do less is better than less variables to do more. At least I think anyway. As for data size shouldn't be an issue. What I haven't bothered with at this stage is negative numbers which are in opposite order. For now I'll focus on positive sorting until I'm happy with the way it works.
     
  32. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    Am I right in saying that job.Run() somewhat does a copy of the struct? I spent some time befuzzled over how, say I have an int X as part of the struct, I create a new job with X=5 and within Execute() I do X=10. If after the job I check what X is, it's still 5!

    Of course the way around that is to use a NativeReference, but how come the variable doesn't change in the struct by the job? I spent some time trying to figure it out but came to the assumption that a copy is created. It also seemed to be the case if creating a persistent Native<> at point of construction (rather than a reference). Or am I missing something here!
     
  33. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    Another thing bugging me is, if I somehow make some broken code compiled into Burst and it freezes up (could be a minor change that makes no sense why it breaks) - then if I revert, reload Unity etc, it still freezes up likes it's the broken version. I read somewhere about versions not recompiling every time, but not sure if that's the case. Other than that I should probably try using less pointers lol
     
  34. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    Well today it's working again. 2 things it might be. I was trying to tidy up repetitive code with some inlining. Maybe wasn't doing thing properly using a static/Aggressive Inlining. Then again it might not be that. Might also be it doesn't like pre-decrement pointers, eg. *--ptr, but that's just a hunch. Does save writing extra code though.

    Also want to try a new idea. My current one is comparing the array start with the end and swapping, meeting somewhere in the middle. But I'm wondering if memory works faster if going left to right to some point in the middle, rather than left and right to middle? I think it will, so that's my next experiment!
     
  35. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    396
    Thought I would test my BitSort in runtime/IL2CPP. What I don't get is that NativeSortExtension.Sort performs much worse in runtime compared to editor. What causes some things to run slower in runtime than editor? Is there another sort I should be using?

    (65536 ints)

    Editor:
    upload_2022-11-20_21-14-20.png

    Runtime:
    19.055156 ms, 1.018265 ms
    19.198761 ms, 1.004608 ms
    17.796991 ms, 0.971410 ms
    17.770999 ms, 1.042940 ms
    18.400099 ms, 1.013794 ms
    18.998333 ms, 1.007333 ms
    17.632993 ms, 0.974922 ms
    17.312093 ms, 0.970910 ms
    17.294114 ms, 0.977325 ms
    17.259330 ms, 0.970193 ms
    17.365828 ms, 0.978546 ms