Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice
  2. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

Discussion Structs are slow

Discussion in 'Scripting' started by CPU_USAGE, Jan 16, 2023.

  1. CPU_USAGE

    CPU_USAGE

    Joined:
    Jan 15, 2023
    Posts:
    20
    A structure of data is slow did you know?


    This is to say if I had

    X number of struct in a list, and they contained 4 pieces of data that I would organise so I could write less code for them. And so I can obtain groups of data in a single line instead of four individual lines. Such as vector4 we will see the structure is convenient for the programmer, but when it comes down to optimal performance storing and using 4 lists of indexed data, is much faster to compute than storing and using a structure of data

    I will go on.
    Storing 1 list containing 4 data ranges of equivalent size is the absolute fastest method to do anything. Mesh, Image, Anything at all really.

    if a vector 3 position and Euler was merely 3 lists of floats, that you access more cumbersomely via code you would notice it is actually 5 to 10 times faster than obtaining that data from structure. So obtaining that data in huge quantity in a single request, allows for approximately 10 or more times the capacity that can be returned in comparison to having the data in a coding structure.
     
  2. DragonCoder

    DragonCoder

    Joined:
    Jul 3, 2015
    Posts:
    1,677
    Well, to a degree one gotta say: Hence why Unity pushes towards DOTS ;)

    That said, careful about premature optimizations.
    In the end it always matters how much data is requested and how consecutively stored it is. The later is the reason why your list of Vector is faster than a list of structs which contain more data.
    However that only is performance relevant with actions on simple types and asignments of references. Meshes and Images are references to massive amounts of data and thus more likely are the bottleneck if you manipulate that data.

    So unless you are just assigning thousands of those, don't complicate your code for such minimal performance improvement!
     
    Terraya likes this.
  3. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,617
    It depends heavily on the usage pattern. But yes, in general the way C# does things often results in sub-optimal access. On one hand constructors are often called when they probably don't need to be, and on the other there are many cases where to change one value an entire struct needs to be copied... twice.

    While I find this grating, it's important to realise that most of the time it makes zero practical difference to performance. But the difference to code readability, maintainability, etc. is quite significant, and worth it most of the time.

    And, as you point out, in the rare cases where you really do need performance there are faster options, just so long as you're familiar with them.

    (And while I assume anyone talking about this stuff would already know this, make sure any attempted optimisations are tested in a build on your target device, because the Editor eats performance, and IL2CPP can change performance characteristics of C# code anyway.)
     
  4. PraetorBlue

    PraetorBlue

    Joined:
    Dec 13, 2012
    Posts:
    7,893
    You're starting down the long pathway of thought Jonathan Blow started thinking about probably 10-20 years ago.
     
  5. lordofduct

    lordofduct

    Joined:
    Oct 3, 2011
    Posts:
    8,514
    I would like to see the tests you've ran to come to this conclusion as I am suspect of your findings. And I have a sneaking suspicion that the results you're seeing are more related to the use of List rather than the use of struct.

    While what I'm getting at is sort of related to structs. In that large structs can be slower to copy and is why Microsoft suggests keeping your structs small:
    https://learn.microsoft.com/en-us/d...-between-class-and-struct?redirectedfrom=MSDN

    This size concern has to do with copying (for example passing a struct as a parameter without a 'ref' prefix). The more bytes that make up your struct, the more bytes that have to be copied.

    And this is where the List comes in. Accessing members of a List requires copying the value from inside the List to outside the List, and this is because the List's index property is that... a property. As a result it's like the return parameter of a method and therefore returns a copy. (there is newer features in C# that complicate this conversation, but I'm going to avoid them for brevity sake... in this case of List it's a copy).

    You can see this in that a list of Vector3 like so fails:
    Code (csharp):
    1. var lst = new List<Vector3>();
    2. lst.Add(new Vector3(1f, 1f, 1f));
    3. lst[0].x += 1f;
    4. Debug.Log(lst[0]);//prints 1,1,1
    This wouldn't just fail, the compiler should actually show an error saying "CS1612 Cannot modify the return value of 'List<Vector3>.this[int]' because it is not a variable"
    https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/compiler-messages/cs1612?f1url=?appId=roslyn&k=k(CS1612)

    This copying slows down the list. And since a List of floats is much smaller amount of data... it'll end up being faster for individual access. If you only needed to access just the y property of the Vector3... in the list<Vector3> you end up copying the entire Vector3 then reading just the y. Where as the 3 list<float> you only copy the single y as you're directly accessing the only value you need.

    Thing is, if you replaced a List<T> for a T[] array... this is no longer an issue as you're not copying the values anymore. Array's allow for direct access of the elements in the array. An array will be significantly faster than a list.

    ...

    But I'm still suspect of your results and would like to see them. Cause I just slapped together this simple test:
    Code (csharp):
    1.     public class zTest02 : MonoBehaviour
    2.     {
    3.  
    4.         public const int MAX_CNT = 1000000;
    5.  
    6.         private TestStructsList _testStructsList = new TestStructsList();
    7.         private TestStructsArray _testStructsArray = new TestStructsArray();
    8.         private TestSegmentedList _testSegmentedList = new TestSegmentedList();
    9.         private TestSegmentedArray _testSegmentedArray = new TestSegmentedArray();
    10.         private TestSegmentedListOnlyY _testSegmentedListOnlyY = new TestSegmentedListOnlyY();
    11.  
    12.         private System.Diagnostics.Stopwatch _stopwatch = new System.Diagnostics.Stopwatch();
    13.  
    14.         private void Start()
    15.         {
    16.             _testStructsList.Initialize();
    17.             _testStructsArray.Initialize();
    18.             _testSegmentedList.Initialize();
    19.             _testSegmentedArray.Initialize();
    20.             _testSegmentedListOnlyY.Initialize();
    21.         }
    22.  
    23.         private void Update()
    24.         {
    25.             if (!Input.GetKeyDown(KeyCode.Space)) return;
    26.  
    27.             _stopwatch.Restart();
    28.             _testStructsList.TestSpeed();
    29.             _stopwatch.Stop();
    30.             var ts_structlist = _stopwatch.Elapsed;
    31.  
    32.             _stopwatch.Restart();
    33.             _testStructsArray.TestSpeed();
    34.             _stopwatch.Stop();
    35.             var ts_structarr= _stopwatch.Elapsed;
    36.  
    37.             _stopwatch.Restart();
    38.             _testSegmentedList.TestSpeed();
    39.             _stopwatch.Stop();
    40.             var ts_seglist = _stopwatch.Elapsed;
    41.  
    42.             _stopwatch.Restart();
    43.             _testSegmentedArray.TestSpeed();
    44.             _stopwatch.Stop();
    45.             var ts_segarr = _stopwatch.Elapsed;
    46.  
    47.             _stopwatch.Restart();
    48.             _testSegmentedListOnlyY.TestSpeed();
    49.             _stopwatch.Stop();
    50.             var ts_onlyy = _stopwatch.Elapsed;
    51.  
    52.  
    53.             Debug.Log($"{ts_structlist.TotalMilliseconds:0.000}, {ts_structarr.TotalMilliseconds:0.000}, {ts_seglist.TotalMilliseconds:0.000}, {ts_segarr.TotalMilliseconds:0.000}, {ts_onlyy.TotalMilliseconds:0.000}");
    54.  
    55.         }
    56.  
    57.  
    58.     }
    59.  
    60.  
    61.     public class TestStructsList
    62.     {
    63.  
    64.         private List<Vector4> lst = new List<Vector4>(zTest02.MAX_CNT);
    65.  
    66.         public void Initialize()
    67.         {
    68.             lst.Clear();
    69.             for (int i = 0; i < zTest02.MAX_CNT; i++)
    70.             {
    71.                 lst.Add(new Vector4(i, i, i, i));
    72.             }
    73.         }
    74.  
    75.         public void TestSpeed()
    76.         {
    77.             for (int i = 0; i < lst.Count; i++)
    78.             {
    79.                 var v = lst[i];
    80.                 v.x += 1f;
    81.                 v.y += 1f;
    82.                 v.z += 1f;
    83.                 v.w += 1f;
    84.                 lst[i] = v;
    85.             }
    86.         }
    87.  
    88.     }
    89.  
    90.     public class TestSegmentedList
    91.     {
    92.  
    93.         private List<float> lstx = new List<float>(zTest02.MAX_CNT);
    94.         private List<float> lsty = new List<float>(zTest02.MAX_CNT);
    95.         private List<float> lstz = new List<float>(zTest02.MAX_CNT);
    96.         private List<float> lstw = new List<float>(zTest02.MAX_CNT);
    97.  
    98.         public void Initialize()
    99.         {
    100.             lstx.Clear();
    101.             lsty.Clear();
    102.             lstz.Clear();
    103.             lstw.Clear();
    104.             for (int i = 0; i < zTest02.MAX_CNT; i++)
    105.             {
    106.                 lstx.Add(i);
    107.                 lsty.Add(i);
    108.                 lstz.Add(i);
    109.                 lstw.Add(i);
    110.             }
    111.         }
    112.  
    113.         public void TestSpeed()
    114.         {
    115.             for (int i = 0; i < lstx.Count; i++)
    116.             {
    117.                 lstx[i] += 1f;
    118.                 lsty[i] += 1f;
    119.                 lstz[i] += 1f;
    120.                 lstw[i] += 1f;
    121.             }
    122.         }
    123.  
    124.     }
    125.  
    126.     public class TestStructsArray
    127.     {
    128.  
    129.         private Vector4[] lst = new Vector4[zTest02.MAX_CNT];
    130.  
    131.  
    132.         public void Initialize()
    133.         {
    134.             for (int i = 0; i < zTest02.MAX_CNT; i++)
    135.             {
    136.                 lst[i] = new Vector4(i, i, i, i);
    137.             }
    138.         }
    139.  
    140.         public void TestSpeed()
    141.         {
    142.             for (int i = 0; i < lst.Length; i++)
    143.             {
    144.                 lst[i].x += 1f;
    145.                 lst[i].y += 1f;
    146.                 lst[i].z += 1f;
    147.                 lst[i].w += 1f;
    148.             }
    149.         }
    150.  
    151.     }
    152.  
    153.     public class TestSegmentedArray
    154.     {
    155.  
    156.         private float[] lstx = new float[zTest02.MAX_CNT];
    157.         private float[] lsty = new float[zTest02.MAX_CNT];
    158.         private float[] lstz = new float[zTest02.MAX_CNT];
    159.         private float[] lstw = new float[zTest02.MAX_CNT];
    160.  
    161.  
    162.         public void Initialize()
    163.         {
    164.             for (int i = 0; i < zTest02.MAX_CNT; i++)
    165.             {
    166.                 lstx[i] = i;
    167.                 lsty[i] = i;
    168.                 lstz[i] = i;
    169.                 lstw[i] = i;
    170.             }
    171.         }
    172.  
    173.         public void TestSpeed()
    174.         {
    175.             for (int i = 0; i < lstx.Length; i++)
    176.             {
    177.                 lstx[i] += 1f;
    178.                 lsty[i] += 1f;
    179.                 lstz[i] += 1f;
    180.                 lstw[i] += 1f;
    181.             }
    182.         }
    183.  
    184.     }
    185.  
    186.     public class TestSegmentedListOnlyY
    187.     {
    188.  
    189.         private List<float> lstx = new List<float>(zTest02.MAX_CNT);
    190.         private List<float> lsty = new List<float>(zTest02.MAX_CNT);
    191.         private List<float> lstz = new List<float>(zTest02.MAX_CNT);
    192.         private List<float> lstw = new List<float>(zTest02.MAX_CNT);
    193.  
    194.         public void Initialize()
    195.         {
    196.             lstx.Clear();
    197.             lsty.Clear();
    198.             lstz.Clear();
    199.             lstw.Clear();
    200.             for (int i = 0; i < zTest02.MAX_CNT; i++)
    201.             {
    202.                 lstx.Add(i);
    203.                 lsty.Add(i);
    204.                 lstz.Add(i);
    205.                 lstw.Add(i);
    206.             }
    207.         }
    208.  
    209.         public void TestSpeed()
    210.         {
    211.             for (int i = 0; i < lstx.Count; i++)
    212.             {
    213.                 lsty[i] += 1f;
    214.             }
    215.         }
    216.  
    217.     }
    And I get these results:
    upload_2023-1-16_15-38-29.png
    (structlist, structarray, segmentedlist, segmentedarray, only_y)

    For starters we can see the array's are WAY faster than the Lists. Like I said before, Arrays allow for direct access.

    But we can also see that doing work on the List<Vector4> is faster than the segmented of 4 List<float>s.

    The exception being the doing work on ONLY the y member of the segmented of 4 List<float>s, as I predicted, because copying that 1 float is faster than copying an entire Vector4. But it's only about 20% faster, and a lot of that is because I'm doing less work (1 += vs 4 +=). No where near the 5 to 10 times faster that you're referring to.

    ...

    Now what is telling is that the array of Vector4 is actually slower than the 4 arrays of floats.

    No where near the scale you referred to of 5 to 10 times. But it's 8-9% faster. Why is that?

    Well, it's because of how we access the memory when it comes to an array. An array takes up a single swath of memory that allows it to fit all its elements (plus a little bit of header and buffer data). But in the end it's just basically a single swath of memory. So a byte array of [3,4,5,6] might look like this in memory:
    0x3|0x4|0x5|0x6

    The pointer address to this array will be the position of the head of the array. And to access any member of it will be at address:
    &array_head + (index * element_size)

    Note that access a struct is sort of similarish. In that a struct will be accessed at the address of the head of the struct + the offset of the specific member:
    &struct + member_offset

    So a Vector4 might be shaped:
    float|float|float|float
    (where each float is 4 bytes)

    And the address of x,y,z,w are:
    &struct + 0
    &struct + 4
    &struct + 8
    &struct + 12

    respectively

    So this means accessing say the y value of the Nth Vector4 in the array becomes:
    $array_head + (index * element_size) + 4

    There's more work in figuring out the address, so therefore it's every so slightly slower.

    But it's only slower because of HOW we're using the data! What if we wanted to treat this thing as a Vector4 and perform the methods associated with vectors like Dot, Magnitude, Min, Max, Lerp, Normalize, MoveTowards, etc.

    Well for starters we'd either have to turn our segmented arrays into the appropriate Vector4, then perform the method on them, costing us the price of creating the struct (and potentially a cache miss... which we'll get to in a bit). Meaning we're now copying floats out of the array and into a new struct.

    OR

    We do the arithmetic long hand adhoc there in place. Which is... cumbersome and annoying to write as a programmer. All to save a teeny tiny little bit of performance.

    Performance that could likely also be mitigated by strong inlining of the previously mentioned methods on the struct (basically telling the compiler to forcefully compile the contents of the Dot, Magnitude, etc method into the scope of the calling code to avoid the overhead of initiating the method/subroutine at the expense of duplicating code in the compiled result).

    ...

    Which brings me to DOTS.

    So DOTS is short for "Data-Oriented Technology Stack". Basically Unity has created a library within Unity to facilitate what is "Data-Oriented Design". DoD is all about organizing your program to optimally sort your necessary data in a way that makes it really fast for your CPU to operate on it.

    What I mean is this. Say you have thousands of Vector3's in memory and you want to adjust all of them (move them forward?). In a traditional OO design those position vectors may be scattered all over RAM. And how a CPU works it needs to copy those positions into its local cache, perform the operation, and copy it back out into RAM. So the CPU starts telling RAM to give it random different addresses and start copying them individually into cache. This is slow. This is like telling your assistant to go into a garage and pick up random nails from the ground 1 by 1 that are scattered around and to bring them to you and place them in your nail box next to you.

    This scattering and therefore indeterminate amount of time for each entry to get put into cache means that the CPU may reach into the cache to grab the next vector to operate on and it's not there. This is called a "cache miss" and the CPU is left waiting for the value to show up. And this can cause stutter in your calculations.

    So instead, what if in RAM you put ALL of those vectors side by side in memory... like you might find in an array! Then when the CPU needs them... it can tell RAM to just start copying the data at the start of that swath of memory and just keep copying it for N elements.

    This act is way more predictable. It's like setting up a conveyor belt from the garage to my box of nails at my side. I have a constant supply of nails (vectors) to operate on, and as long as I'm not outpacing the conveyor belt, I'll never run out, until I'm done!

    And this is what DOTs is designed to facilitate. Say you want to have thousands of MOBs on screen that all flock around by a reproduceable algorithm (each vector is just having the same reproduceable arithmetic occurring on it). We organize the memory to facilitate this job, and then just streamline it through the CPU as fast as possible, thusly optimizing the act.

    BUT MIND YOU... the way we organize the data in memory is particular to the job at hand.

    And this is why I mentioned earlier:
    If we're just going to treat them as a clump of individual floats independent of one another. Keeping them as segmented floats may be better. Or hell... why not just an array of floats that is 4 * number of vectors in length. Their relationship with one another is inconsequential to the algorithm we're applying to them, and we've optimized further by putting all of them in 1 large swath of memory rather than 4 swaths.

    But really in the end the DOTs got us 99% of our efficiency (again look back at my numbers... the List was wayyyy slower than the Array). Why convolute the algorithm by making it horribly unreadable to get that tiny bit of extra oomph.

    Sure, if it turns out that extra 1% is the difference between making/breaking your game. Go for it.

    But really... it's minor.

    And at that point... if convoluting your code for those diminishing returns. Why even write in C# at all? Why not C++? Why not pure C? Why not... assembly??? You want performance at the expense of unreadable code. Assembly is king!

    And this brings me back to the beginning.

    I'm suspect of the idea that your results are because of structs. The speed difference is low for reasonably sized structs. And instead the performance discrepancy may be WHAT you're doing with those structs vs segmented lists. The copying of data, the fact you may only be accessing a smaller subset of a larger set of data (the y value of x,y,z,w), things like that.

    ...

    And I'd like to add there is an expense to organizing like this. Adding a dynamic set of MOBs becomes difficult.

    Think of like why you might use List over Array? Well... you can ADD entries to a List. You can't to an array. (note really a List just resizes your array as necessary... so technically the List is just an array with helper methods to save you the need to write the resize logic over and over).

    But the same goes for the DoD. If I setup 1000 mobs to be animated in a flock around screen. And I want to add 1 randomly. I either have to resetup the entire data structure OR preemptively setup the data structure with say 2000 mob limit ahead of time to facilitate it.

    This is actually how old games used to do things on the low power systems that they were. Why was there a sprite limit on screen? Because the memory and draw routine was optimized to target a specific set of capabilities and came with that limit to facilitate it.

    Data oriented design is powerful, but it creates new limitations in other ways. You as a programmer/engineer, your job is determine if those limitations are worth it. Unreadable optimized code to hit that 60fps on last gen hardware??? Worth it.

    But what if you're already getting 300fps? Well readable and maintainable code becomes king.

    You know your game only ever has 100 pikmin allowed at any given time! DoD it up my friend, now the AI/pathfinding loop is way faster on that low powered Gamecube!

    You want to be able to facilitate as many mobs as memory allows you to as the game grows? Well... stuff's going to get real complicated real fast.
     
    Last edited: Jan 16, 2023
    ADNCG, arkano22, Nad_B and 2 others like this.
  6. AngryProgrammer

    AngryProgrammer

    Joined:
    Jun 4, 2019
    Posts:
    490
    The fastest and memory-saving solution you can have is to handle integers instead of structs.

    Modern programmers forget about something like bitwise operators and storing tons of data just on single integer. Excess resources spoiled us.

    There is no faster operation than a bitwise operation and index operation.

    Example: a single integer can store x and y coordinates, type of tile, who is on it, etc.
     
    Last edited: Jan 17, 2023
  7. dlorre

    dlorre

    Joined:
    Apr 12, 2020
    Posts:
    699
    The purpose of using list is to insert and remove elements frequently. If you do that, 1 list of struct is faster than 3 lists. Now if you access data more frequently than insert and remove, you shouldn't use list but arrays. This is a balance thing.
     
    DragonCoder and Bunny83 like this.
  8. arkano22

    arkano22

    Joined:
    Sep 20, 2012
    Posts:
    1,891
    You just described AoS (Array of Structs) vs SoA (Struct of Arrays). This is well known stuff.

    Which one is faster depends on your data access pattern. This is closely related to cache performance, hot/cold splitting and to some extent, SIMD.

    On a side note, this also is what Unity's DOTS is all about.
     
    Bunny83 likes this.
  9. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,617
    Note that what C# calls a List is very different to a "Linked List" as described in many data structures resources and courses, with exactly this given as its main strength. In C# a "List" is an automatically resizing array, conceptually similar to a vector in C++.
     
    dlorre likes this.
  10. Yoreki

    Yoreki

    Joined:
    Apr 10, 2019
    Posts:
    2,605
    Technically speaking, due to the limited amount of very fast CPU cache we have, having to load less unused data is faster. This is why, generally speaking, data oriented programming approaches are indeed more performant than object oriented programming approaches. However, generally speaking, object oriented programming is easier to work with than data oriented programming.

    For the overwhelming majority of games and other applications, you wont notice or rather need this performance difference however. What matters most for modern software is speed of development and scalability. Both of which are why we arrived at object oriented programming shemes in the first place. Readability and maintainability scale well with object oriented structures.
     
    lordofduct likes this.
  11. arkano22

    arkano22

    Joined:
    Sep 20, 2012
    Posts:
    1,891
    Any system that needs to scale in size needs to be designed in a data-oriented approach, and this includes a lot of mid sized/large games. How you lay your data out in memory and how you access it tends to have a much larger impact on performance than arithmetic operations.

    What I've found over time is that the "hotness" of any code path (how often it will be executed) and how much data it must be able to potentially handle are the best indicators of which approach to take: if something will not be executed very often or doesn't need to scale to really big amounts of data, an object-oriented approach is fine.

    Things like traffic systems, units in RTS games, or particle effects are good examples of stuff that should pretty much always be data-oriented.
     
    Yoreki and angrypenguin like this.