Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

For super fast processing of data is there any way to ensure data is stored in cache?

Discussion in 'Entity Component System' started by Arowx, Oct 9, 2020.

  1. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Let's say you want to write a super fast game and the core data elements in that game can all be stored in cache.

    For example your game could have 1000 (float3 4x3x1000 = 12k) moving things that need to be updated and processed by a set of systems physics, AI, navmesh, input, collision, audio.

    Each system may bring additional data to work with but the core game entity data is common to all the systems. Therefore if that data could be maintained in cache it could reduce the RAM IO bandwidth needed to run the game in theory boosting the games performance significantly.​

    Your DOTS system's then just have to keep a reference to the cached native array and pass it between them processing the data at super fast speeds with only the code cache and additional data having to be queued up with each systems as it is triggered.

    So is there a way to store data only in the cache between system calls and therefore reduce read/write to memory delays and what level of performance could this provide?

    Note: This is termed prefetching https://en.wikipedia.org/wiki/Cache_prefetching can DOTS do this?
     
    Last edited: Oct 9, 2020
    Egad_McDad and Krajca like this.
  2. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    For modern platforms (x86 and ARM), outside of a couple of prefetching instructions which are merely hints, the cache hierarchy is completely abstracted from the executing code. The memory controller may choose to evict cache lines for whatever reason it chooses. If you wanted the address range in cache to be persistent, you would need something like TCM, which you will only find in custom hardware architectures.

    If you access cache lines linearly, the hardware prefetcher will catch on and start "running ahead" to reduce latency. If you need random access like iterating through an array of indices, then you can try using software prefetching, but it is very difficult to get right.

    Burst has software prefetching intrinsics.
     
    Occuros, NotaNaN, SenseEater and 2 others like this.
  3. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    I'm not sure your correct here, it looks like the PREFETCHh instructions are designed to bring ram data into the CPUs cache. The term 'hint' refers to the fact that the instruction does not affect the programs behaviour. It is there to move data into the cache and the Intel x86 instruction even has four different cache level 'hints'.

    source - https://www.felixcloutier.com/x86/prefetchh

    The ARM instruction set has the PRFM (immediate) op code (souce)

    GCC even has a __builtin_prefetch() function (source) designed to ensure the memory is pulled into cache which I presume uses the PREFETCHh instruction set.

    _mm_prefetch for Microsoft and Intel C++ compilers.

    Maybe Unity could add this to their Burst / IL2CPP / DOTS build system where it could have the most impact e.g. maintaining a cache of positions/transforms for the main moving game elements between systems.

    Or implement it within the Unity engine itself under the hood?
     
    Last edited: Oct 11, 2020
  4. Joachim_Ante

    Joachim_Ante

    Unity Technologies

    Joined:
    Mar 16, 2005
    Posts:
    5,203
    ArrowX, please actually take the time to read posts from others on the forum when they reply.
     
    Neonlyte, Occuros, OndrejP and 6 others like this.
  5. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    OK but would they ensure core component position data is maintained in cache between systems as this data in theory is core for most games.

    Or would managing the system process flow need to be managed to ensure cache/core 'affinity' is maintained?

    Are software prefetch instructions using PREFETCH op codes or could they be better implemented with CPU op codes.
     
    Last edited: Oct 11, 2020
  6. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    ;)
    There's this magical tool in DOTS called "Burst". With it comes another magical tool called the "Burst Inspector". The Burst Inspector is magical because it shows you the exact assembly instructions it generates for any C# code it compiles. That means that this tool combined with VTune contains not just the answer to this question, but perhaps all related questions in the context of x86. While I have used Valgrind with ARM in a non-Unity context, I don't actively target ARM in my Unity projects, so I can't say nearly as much there.

    Anyways, a lot of your questions have been theoretical in a domain that is very application-specific. You'll get a lot more answers testing stuff out on your own rather than asking on the forums. If you need some real-world use cases to experiment and test your theories on, I have some. PM me.
     
  7. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    It's just an idea that's forming:
    • Latency of any kind is a bad thing in games.
    • CPU processing is fastest on things in L1 cache.
    • We seem to always pass a single rendering buffer to the GPU (I think?).
    What if we could change/use these to boost game engine performance and reduce latency...
    • Send at least two rendering layers to the GPU, one with static scenery the other with dynamic game elements.
    • Keep fast twitch latency related data/code in the L1 cache e.g. controller/inputs, players, camera.
    With a layered latency centric approach in theory we could have much more performant games that utilise the raw speed of the CPU where it is needed most.

    Side note: In theory we could fit an entire retro 8 bit game into a modern L1 cache as well as the 8 bit systems Display RAM and OS ROM.
     
  8. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    This is how Hybrid Renderer V1 worked. Hybrid Renderer V2 has one buffer but only sends up parts of it each frame based on component change filters. This is even more granular because now if a chunk of entities only change colors every three frames, those colors (and only the colors, not necessarily the other properties) will be uploaded every three frames.

    Once stuff gets dragged into L1 (which is every time you touch the data), it doesn't leave until something needs to take its place. So if most of your game fits in L1, then the data you care about is going to be in L1 most of the time, automatically. That's how the hardware works. Trying to change that behavior will more likely make things slower because the engineers that design these systems are crazy smart and are regularly hitting "speed of light" limits in the hardware design.

    Fitting a game into L1 cache mostly matters for battery-sensitive platforms. That's why for Project Tiny using smaller data types is typically better than optimizing instruction count. But for other platforms, if the game fits in L1, it probably isn't that taxing on the system even if L1 was ten times slower. That's just not a lot of data to compute. No, the issue is when you scale the simulation up way past the size of L1, L2, and even L3 that performance starts to matter. But at that point especially with ECS as-is, you are almost always going to run into other bottlenecks before you run into cache efficiency issues. I've run into an issue with too much trig in a shader, an issue with microtriangles, and an issue with the scale of the data actually making EntityCommandBuffer expensive and I have threads idling. You never know what your next bottleneck is going to be. That's why this is so application-specific and why talking purely "theoretically" about "ideas" is non-sensical without the context of an actual problem.

    As I said before, if you need a real-world use case with actual performance problems (at scale) to apply your "ideas" to, I have one. Even if all you do is run it in the profiler, look at the code, and ask questions about the performance problems, you will learn far more than any answer to any theoretical question will ever teach you.
     
    Arnold_2013, slims, Occuros and 6 others like this.
  9. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    What about a system core affinity, where code and data related to latency on a multi-core system could be limited to one core ensuring that the code and data stays in cache.
     
  10. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    Is this an actual issue you are encountering?
    If not, STOP! And reread what you just replied to over and over until it makes sense why I am asking you to reread it.
     
    Last edited: Oct 14, 2020
  11. WAYNGames

    WAYNGames

    Joined:
    Mar 16, 2019
    Posts:
    939
    First make your code work, then IF it's too slow, Analyze it and figure out optmimizations.
     
    FakeByte likes this.
  12. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    I'm referring to the inherent latency in our game dev tech, game engines, graphics APIs, architectures and even the hardware itself.

    For instance Unity just recently analysed their rendering pipelines buffering system and found there was room for optimisation -> https://blogs.unity3d.com/2020/10/0...020-2-for-smoother-gameplay-what-did-it-take/

    Input latency is a big issue with games and a recurring issue with Unity games. The thing is our hardware is so amazingly fast we have 1000hz mice, 240 hz monitors and 2-5 Ghz CPUs and GPUs so why should it take 30-60ms for a user input to show on a display that is updating every 4.1666 ms?

    In VR games/tech they need to allow for dynamic rendering to provide a smooth experience with minimum input lag so they have techniques that could be used in regular games to improve input latency.

    In a way monitor manufacturers are pilling the pressure on game engines and CPUs/GPUs to keep up with ever expanding, faster and faster displays.

    PS The ideal resolution for VR is above 6k @ 240hz and for monitors over 8k @ 240hz, after that we are limited by human vision. So we are approaching the upper human limits of 2D display technology.
     
  13. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Mind you it looks like tech companies are building latency testing systems into hardware...



    In this video a latency testing system by Nvidia shows that even at 360hz refresh rates the input latency of a AAA title is still over 10 ms.
     
  14. chadfranklin47

    chadfranklin47

    Joined:
    Aug 11, 2015
    Posts:
    217
    Sorry to resurrect this thread.

    I have a trivial job I am trying to optimize and the auto vectorizer seems to have done a great job from looking at the burst inspector:
    Screenshot_5970 (Medium).jpg

    Now I have a different version of that job that uses NativeSlice instead of NativeArray and that version uses all scalar operations. I expected to see a performance increase but I am seeing no change in performance between the two. My first thought was that the code is memory bound as the job is not very complex.

    Would that be an accurate assessment here? I am reading linearly from an array so I would expect that cache is being used effectively. I am not sure about writing back to the output array.

    Here is the job in question:
    Code (CSharp):
    1. [Unity.Burst.BurstCompile(FloatPrecision.Standard, FloatMode.Fast)]
    2. public struct RemapFloatsToByteJob : IJobParallelFor
    3. {
    4.     [ReadOnly] private NativeArray<float> input;
    5.     [WriteOnly] private NativeArray<byte> output;
    6.     private readonly float2 inputMinMax;
    7.  
    8.     public RemapFloatsToByteJob(NativeArray<float> input, NativeArray<byte> output, float2 inputMinMax)
    9.     {
    10.         this.input = input;
    11.         this.output = output;
    12.         this.inputMinMax = inputMinMax;
    13.     }
    14.  
    15.     public unsafe void Execute(int index)
    16.     {
    17.         output[index] = (byte)math.round(math.lerp(0f, 255f, math.clamp(math.unlerp(inputMinMax.x, inputMinMax.y, input[index]), 0f, 1f))); // remap clamped
    18.     }
    19. }
    Also, are there any examples of correct usage of the prefetching intrinsics? I haven't been able to find any, not sure if it would be of any use here anyway.

    Is it usual to see no performance gain from vectorizing small jobs like this or is there something I'm misunderstanding? Thanks.
     
    Last edited: Oct 27, 2022
  15. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    That is indeed strange. First, how are you measuring the performance? It could be that what you are measuring is not actually the code you are seeing in the Burst Inspector.

    But if you have heavy-enough workload focused on the task with safety checks and all disabled, I personally would resort to VTune to see what it knows. I don't have a great answer if VTune isn't an option for you.
     
    chadfranklin47 likes this.
  16. chadfranklin47

    chadfranklin47

    Joined:
    Aug 11, 2015
    Posts:
    217
    @DreamingImLatios thanks for the help. I am simply using a Stopwatch and calling complete on the job and measuring the time to finish (in both the editor with checks and all disabled as well as in a build). Both jobs take about the same time regardless of the workload size.

    I am not familiar with VTune, but I'll look into using that.
    Edit: I have an AMD processor so it looks like VTune isn't an option after all. May look at trying Superluminal.
     
    Last edited: Oct 4, 2022
  17. chadfranklin47

    chadfranklin47

    Joined:
    Aug 11, 2015
    Posts:
    217
    Some more things I tested:

    Set the OptimizeFor field of the BurstCompile attribute on the scalar job to "FastCompilation" (as that is supposed to disable auto-vectorization, though using NativeSlice seemed to do the trick already)
    Set OptimizeFor for the vectorized job to "Performance"

    Both parallel jobs still took about the same time, with the scalar version seeming to slightly outperform the vectorized version by a few ms (with the totals being around 500ms for this test).

    Running a single-threaded IJob instead of IJobParallelFor, the vectorized version is about 1.75x as fast as the scalar version. Making the job more complex (performing additional rounding and lerps) gives a better edge to the vectorized version with the vectorized version being 3-4x as fast as the scalar when doing 5x the work as shown in my code above. These measurements are still done with a Stopwatch.

    I do hope these results are strange or it would mean, for simple jobs, vectorization has little benefit. My guess remains that writing to memory is just much slower than the simple calculations I'm doing. I have another job where the data is calculated on the fly (no input array, only output array) and I am getting similar results there.

    I'd feel better about it though if someone confirmed this is the case for them also (either with my job above or any of your own).

    Some resources that put me at ease (in the sense that this is expected of a simple job and I'm not doing something wrong):
    https://stackoverflow.com/questions...ructions-be-executed-without-cache-memory-acc
    https://qr.ae/pvevp1 (link to a response on Quora)

    The linked Quora answer by Dr. Victor Eijkhout seems to also provide an answer for why the multithreaded vectorized job has no performance edge over the scalar job:

    Edit:

    Here is another observation...
    modifying the body of both threaded jobs to:
    Code (CSharp):
    1. output[index] = (byte)input[index];
    remained at 500ms for both jobs (no performance increase)
    whereas modifying to:
    Code (CSharp):
    1. // using index itself, not the input value at index (so no reading from memory)
    2. output[index] = (byte)math.round(math.lerp(0f, 255f, math.clamp(math.unlerp(inputMinMax.x, inputMinMax.y, index), 0f, 1f)));
    dropped the time down to ~100ms for both jobs. (5x performance increase)

    Which seems to confirm that the job is memory bound as removing those additional computations didn't speed things up whereas removing the memory-read did. I expected the memory-write to be more of a factor than the memory-read but the input type is larger than the output type here... so that must be taken into account.

    These tests were done with an array length of one billion.
     
    Last edited: Oct 9, 2022
  18. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Could you optimise your core code further?

    Code (CSharp):
    1. output[index] = (byte)math.round(math.lerp(0f, 255f, math.clamp(math.unlerp(inputMinMax.x, inputMinMax.y, input[index]), 0f, 1f)));
    math.<functions> tend to be slow and could be replaced with a few lines of code. for example see below.

    Code (CSharp):
    1. //unlerp
    2.  
    3. float v = (value - min) / (max-min);
    4.  
    5. // as max min will probably not change you could invert it
    6.  
    7. float invRange = 1f / (max-min); // then pass it into the function
    8.  
    9. // then
    10.  
    11. float v = (value - min) * invRange;
    12.  
    13. // Note: multiplication tends to be faster than division on x86
    14.  
    There will probably be more low-level smart optimisations you can apply when you unroll your math.<function> code e.g. duplicate sub-calculations or inverted division opportunities.
     
    Last edited: Oct 4, 2022
    chadfranklin47 likes this.
  19. chadfranklin47

    chadfranklin47

    Joined:
    Aug 11, 2015
    Posts:
    217
    @Arowx Thanks for the tip, might have to use that everywhere now...

    There was a significant performance increase in both single-threaded jobs with the scalar version benefitting more. The vectorized version is now about 1.35x as fast as the scalar (previously 1.75x). As for the multithreaded jobs, there was no measurable performance difference. I edited my previous post with some findings that help explain why.
     
    Last edited: Oct 7, 2022
  20. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    OK had a play and for 1,000,000 array elements got these results:

    A: 938,857 ticks.
    B: 37,051 ticks

    Method A
    Code (CSharp):
    1. outputA[i] = (byte)math.round(math.lerp(0f, 255f, math.clamp(math.unlerp(inputMinMax.x, inputMinMax.y, input[i]), 0f, 1f)));
    Method B
    Code (CSharp):
    1. double maxFactor = ((1/((double)inputMinMax.y - (double)inputMinMax.x)) * 255);
    2. float min = inputMinMax.x;
    3.  
    4. for (int i = 0; i < input.Length; i++)
    5. {
    6.             double f = (input[i] - min);
    7.             f *= maxFactor;
    8.             outputB[i] = (byte)(f+ 0.499999999999);
    9. }
    I then check the results and the faster method does generate about 10 off by 1 errors.

    So that's a 25.33958597608702 times boost in performance.

    Note: running in Editor.
     
    Last edited: Oct 4, 2022
    chadfranklin47 likes this.
  21. chadfranklin47

    chadfranklin47

    Joined:
    Aug 11, 2015
    Posts:
    217
    @Arowx How are those tests performed?

    For 1,000,000 elements I get these results from managed code:
    A: 554,477 ticks
    B: 150,880 ticks
    when I run it in an IJob I get:
    A: 3,461 ticks
    B: 2,772 ticks
     
  22. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Code (CSharp):
    1. using System.Collections;
    2. using System.Collections.Generic;
    3. using UnityEngine;
    4. using Unity.Mathematics;
    5. using System.Diagnostics;
    6. using TMPro;
    7.  
    8. public class testFloat2Byte : MonoBehaviour
    9. {
    10.     const int sampleSize = 1000000;
    11.     float[] input = new float[sampleSize];
    12.     byte[] outputA = new byte[sampleSize];
    13.     byte[] outputB = new byte[sampleSize];
    14.  
    15.  
    16.     public TextMeshProUGUI text;
    17.  
    18.     void Start()
    19.     {
    20.         for(int i = 0; i < input.Length; i++)
    21.         {
    22.             input[i] = UnityEngine.Random.Range(float.MinValue, float.MaxValue);
    23.         }
    24.  
    25.         float2 inputMinMax = new float2(float.MaxValue, float.MinValue);
    26.  
    27.         for (int i = 0; i < input.Length; i++)
    28.         {
    29.             if (input[i] < inputMinMax.x) inputMinMax.x = input[i];
    30.             if (input[i] > inputMinMax.y) inputMinMax.y = input[i];
    31.         }
    32.  
    33.         Stopwatch timer = new Stopwatch();
    34.  
    35.         timer.Start();
    36.  
    37.         for (int i = 0; i < input.Length; i++)
    38.         {
    39.             outputA[i] = (byte)math.round(math.lerp(0f, 255f, math.clamp(math.unlerp(inputMinMax.x, inputMinMax.y, input[i]), 0f, 1f)));
    40.         }
    41.  
    42.         timer.Stop();
    43.  
    44.         long t1 = timer.ElapsedTicks;
    45.  
    46.         UnityEngine.Debug.Log("A ticks:" + t1 + " " + timer.ElapsedMilliseconds + "ms");
    47.  
    48.         timer.Restart();
    49.  
    50.         double maxFactor = ((1/((double)inputMinMax.y - (double)inputMinMax.x)) * 255);
    51.         float min = inputMinMax.x;
    52.  
    53.         for (int i = 0; i < input.Length; i++)
    54.         {
    55.             double f = (input[i] - min);
    56.             f *= maxFactor;
    57.             outputB[i] = (byte)(f+ 0.499999999999);
    58.         }
    59.  
    60.         timer.Stop();
    61.  
    62.         long t2 = timer.ElapsedTicks;
    63.  
    64.         UnityEngine.Debug.Log("B ticks:" + t2 + " "+timer.ElapsedMilliseconds+"ms");
    65.  
    66.         int countErrors = 0;
    67.  
    68.         for (int i = 0; i < input.Length; i++)
    69.         {
    70.             if (outputA[i] != outputB[i])
    71.             {
    72.                 //UnityEngine.Debug.Log("A:" + outputA[i] + " != B:" + outputB[i] + " f:"+input[i]+ " min:"+ inputMinMax.x+ " max:"+ inputMinMax.y);
    73.                 countErrors++;
    74.             }
    75.         }
    76.  
    77.         UnityEngine.Debug.Log("Errors :" + countErrors);
    78.  
    79.         text.text = "A: " + t1.ToString("N0") + " ticks\nB: " + t2.ToString("N0") + " ticks\nErrors:" + countErrors;
    80.     }
    81. }
    82.  
    Using basic arrays and monobehaviour not using any Burst Compile, Jobs or DOTS.

    It will depend on Unity and C# version as well as PC hardware.

    It is odd that you're getting such a slow benchmark on B: 150,880 ticks?
     
    chadfranklin47 likes this.
  23. chadfranklin47

    chadfranklin47

    Joined:
    Aug 11, 2015
    Posts:
    217
    @Arowx Ok I see, I was still using NativeArrays in the Editor. Using your script I get:
    A: 580,765
    B: 20,128
    for a 28.85x improvement. Not bad at all! Thanks for taking the time & effort.
     
    Last edited: Oct 7, 2022
    Arowx likes this.
  24. Antypodish

    Antypodish

    Joined:
    Apr 29, 2014
    Posts:
    10,574
    I wonder, if wrapping any of these functions in aggressive inline, makes any difference.
    Or even splitting
    Code (CSharp):
    1. (byte)math.round(math.lerp(0f, 255f, math.clamp(math.unlerp(inputMinMax.x, inputMinMax.y, input[i]), 0f, 1f)));
    To

    Code (CSharp):
    1. var unlerp = math.unlerp(inputMinMax.x, inputMinMax.y, input[i]);
    2. var lerpIt = math.clamp(unlerp, 0f, 1f)
    3. var round = math.lerp(0f, 255f, lerpIt)
    4. byte result = (byte)math.round(round);
    And then wrapping into aggressive inline.
     
  25. Enzi

    Enzi

    Joined:
    Jan 28, 2013
    Posts:
    909
    Superluminal is not fit for this.
    Try AMD uProf. Fantastic tool. IMO much better than VTune.

    And you guys should run Performance Tests, not just a script with Start. It's a nice library and straight forward to use, giving much more accurate measurements.
     
    chadfranklin47 likes this.
  26. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    You must know something I don't, because I have no idea how you do any meaningful micro-architectural diagnosis with uProf. Care to explain your workflow?
     
  27. Enzi

    Enzi

    Joined:
    Jan 28, 2013
    Posts:
    909
    These guys here are clearly looking for microarchitecture exploration and optimizing their L1-3 bounds and branch mispredictions, right?

    If you are truly asking this genuinely and not as knee jerk reaction, uProf has no feature parity to Vtune, but you probably know this. Investigate instruction access has a lot of events you can add, those aptly named L1-3 cache hit/misses should give you enough information about any bounds happening. I was able to accurately find sources from the code and assembly view where my memory bounds and cache misses are.
    As I have used all tools (Vtune, uProf and Superluminal) now on the same project, I got much more concise information with uProf (to my surprise). Vtune was for the most part not able to deal with either the source, the assembly, burst or everything together. The best tool is worth nothing if it can't do the basics consistently and accurately splitting code to assembly is worth a lot especially with aggressive compilers like Burst which tend to inline, move or merge parts of the code.

    As long as Vtune doesn't run on AMD cpus, there are not much options anyway when it comes to instruction based sampling. uProf is at least superior to sample based measurement tools like Superluminal, especially with these kind of tests. Not to downplay the strengths of Superluminal. I own it and really like it.
     
  28. eizenhorn

    eizenhorn

    Joined:
    Oct 17, 2016
    Posts:
    2,653
    They are aggressive inlined
     
    chadfranklin47 and Antypodish like this.
  29. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    I'm asking because I'm due for a system upgrade and it is the one thing holding me back from switching to AMD. I'm skeptical of Windows 11 so staying with Intel is not appealing, but I really do need a powerful tool that can help me analyze where I'm losing throughput in tricky highly cache-coherent situations.

    You don't see many guides on the internet of people using the micro-architecture analysis feature of VTune. Same for uProf. So any knowledge you can share on using uProf for analysis would be extremely helpful! I plan to share my VTune write-up with my next framework release. I don't want to share it now because it is 40 pages worth of text and images better suited for static site generation.
     
    chadfranklin47 and Occuros like this.
  30. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Look at what it does, is the clamp on the second line even needed if the min/max are based on in sample values.
    Why lerp between 0f and 255f when you could just multiply your 0-1 value by 255 and round it to a byte.

    Also as I mentioned above (#20) if you unroll the math.<functions> there are some under the hood optimisations you can get from simple pre-loop calculations that you probably don't get from aggressive inlining.
     
    Last edited: Oct 7, 2022
    chadfranklin47 and Antypodish like this.
  31. Enzi

    Enzi

    Joined:
    Jan 28, 2013
    Posts:
    909
    I think someone with your proficiency will get around. It lacks the nice visual summarys but the relevant data is all there in raw numeric form, still in a digestible manner.
    Probably best to take a look at the user guide:
    https://developer.amd.com/wordpress...ources/57368_User_Guide_AMD_uProf_v3.6_GA.pdf
    I can't sum this up even if I tried. :)

    Keep your old PC around in case you still need VTune. :) Honestly I'd be pretty interested in what you'd be missing though.
     
    chadfranklin47 likes this.
  32. elliotc-unity

    elliotc-unity

    Unity Technologies

    Joined:
    Nov 5, 2015
    Posts:
    228
    Note that if you're using burst, burst has intrinsics for a lot of the unity.mathematics ops that ignore the source code and just replace it with its own special thing. So don't trust the source of the math ops, unless burst is off.
     
    chadfranklin47 and Occuros like this.
  33. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    A lot of people use these profilers for identifying hotspots and basic issues. But honestly, Unity's tools work well enough for me for those sorts of things. I break out VTune when I have an ALU-heavy workload and need to squeeze out all the throughput I can get. Usually that's a a balancing act of reducing instructions without creating what I like to call "port bubbles". With VTune, I can get a good sense of both when port occupancy is bad as well as why. Sure I might have 5 times more branch mispredictions than cache misses, but that might be because I snuck in a bunch of branchy code after requesting an unavoidable random access load and the load is still the limiting factor. Or sometimes I have a tzcnt in a hot loop with immediate dependencies on it (this affects Intel worse than AMD) and despite having less instructions overall, I suddenly have a bunch of cycles with unused ports. Counters by themselves are an illusion of usefulness. It is only when you can compare them relative to each other and the overall throughput that you gain real actionable insight unless you already know what the problem is.
     
  34. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    378
    @DreamingImLatios How easy is it to use VTune with your code/Unity, does it plug into VS somehow?
     
  35. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    It is super easy. I haven't encountered the issues Enzi was describing.

    1. Start VTune with elevated privileges
    2. Make a new project
    3. Start a micro-architecture analysis
    4. Point it at Unity's process (one Windows this is just Unity.exe)
      1. Also worth noting this is much easier if you only have one instance of the editor open
    5. Set the sampling frequency depending on what I am measuring
      1. Performance Tests - 0.1 ms
      2. Play mode with checks disabled and 5 to 30 ms per frame - 1 ms
      3. Play mode with checks disabled on stress test with 60-200 ms per frame - 5 ms
    6. Start the capture paused
    7. Go to Unity and start whatever it is I am capturing
    8. Unpause capture in VTune
    9. Let it capture for 1 to 2 minutes, or for performance tests, capture for the duration the test executes (for short tests, bump up the repeat count to make it run for at least 20 seconds for the unoptimized version)
    10. Stop capture, wait for VTune to do its analysis. You'll see a lot of "Could not locate/find/load" errors. Ignore them. It will find Unity's which is all you care about.
    11. When it is done, you'll get a summary view, but usually you want to switch over to Bottom-Up view and look at the top few functions
    12. In general, performance is usually instructions retired divided by the percentage throughput shown in the pipe graphic (lower is better). For example, if you lose 30% of your throughput to bad speculation, but making your code branchless doubles the instruction count, it probably isn't worth it.
    13. You can double click on the function's source file column to go to a C# source/assembly view. However, aside from seeing which loop or branch is hotter, I don't find it to be that useful.
    Note: VTune can be a little weird sometimes in how it breaks up functions and inlines, so be wary and surf through the function list to watch out for biases. Occasionally you might spot a situation where function call overhead is problematic just by surfing through. But in general this is a tool you want to apply to heavy Burst jobs. Generic jobs work too and unlike in the Timeline view in Unity (though I haven't verified if this is still the case in 22.2), VTune will give you the concrete definitions which is nice.
     
    Last edited: Oct 5, 2022
    Trindenberg likes this.
  36. Trindenberg

    Trindenberg

    Joined:
    Dec 3, 2017
    Posts:
    378
    @DreamingImLatios Thanks for the details much appreciated. :) The one thing I wish I knew is how to link chains of methods with little loss in performance. eg. if I had some simple Add/Substract/Multiply methods to run in a sort of command buffer, is delegate* <> the one to use? InLining surely won't work with a runtime command buffer.
     
  37. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    Depending on the quantity and complexity of your methods, you either want a jump table with the methods inlined, or you want an array of method pointers.

    The former can usually be accomplished with a switch case, and works especially well if your entire switch case loop can fit in a couple kilobytes of instructions. The latter requires all your functions have the same signature. I would try writing the code as a switch case of function calls and see if Burst automatically generates a static array of function pointers. I've seen it do it before. But otherwise you will have to rely on compiling Burst function pointers, which can be tedious to write.

    In any case, you'll likely be bottlenecked in the frontend fetching and decoding instructions from memory. There is a branch target predictor (which is different from the branch predictor) which will help you out.
     
    Trindenberg likes this.
  38. chadfranklin47

    chadfranklin47

    Joined:
    Aug 11, 2015
    Posts:
    217
    In this case, I have a noise generator that generates values within a theoretical range. For example, Mathf.Perlin can generate values slightly below 0 or slightly above 1. This was my solution to that when generating textures based on that noise value. And you're right, the lerp isn't necessary here. I have a method named "RemapClamped" which takes an input range and remaps and clamps it to an output range which I was using for convenience.
     
    Last edited: Oct 7, 2022
  39. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    There are SIMD comparison operations.

    So the question is math.clamp() faster than:
    Code (CSharp):
    1. if (value > max) value = max;
    2. else if (value < min) value = min;
    Some testing needed...?
     
  40. chadfranklin47

    chadfranklin47

    Joined:
    Aug 11, 2015
    Posts:
    217
    With SIMD, clamp
    max(a, min(x, b))
    should be faster as that logic requires two select statements. Looking at the assembly in the burst inspector shows clamp() with 2 instructions vs the select statements with 4.
     
    Last edited: Oct 7, 2022
    Arowx likes this.
  41. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Sounds like the Bust compiler is missing a trick here.

    But what are their benchmark stats (not all instructions take the same time)?
     
    Last edited: Oct 8, 2022
  42. chadfranklin47

    chadfranklin47

    Joined:
    Aug 11, 2015
    Posts:
    217
    Ofc, I've been using this as a reference: https://www.agner.org/optimize/instruction_tables.pdf.
    After some testing, I get roughly the same timings for both at ~1800ms for 1 billion elements.
     
    Elapotp likes this.
  43. chadfranklin47

    chadfranklin47

    Joined:
    Aug 11, 2015
    Posts:
    217
    I wanted to ask if there is a reason for AVX intrinsics not being used for certain math operations such as float4x2 or float4x4 component-wise multiplication? I haven't been able to find much on the forum except some info that may be outdated such as here. Additionally, are there plans for AVX intrinsics for these ops?
     
    Last edited: Oct 9, 2022