Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. We have updated the language to the Editor Terms based on feedback from our employees and community. Learn more.
    Dismiss Notice
  3. Join us on November 16th, 2023, between 1 pm and 9 pm CET for Ask the Experts Online on Discord and on Unity Discussions.
    Dismiss Notice

Do arrays and dictionaries drop performance when they get too large?

Discussion in 'Scripting' started by brigas, Feb 26, 2020.

  1. brigas

    brigas

    Joined:
    Oct 4, 2014
    Posts:
    522
    for O(1) operations are arrays and dictionaries supposed to drop performance when too large?

    for example I'm having trouble accessing dictionaries after 2million entries
    and arrays after 200 million entries

    by the way this is only for purposes of direct access O(1), I dont loop through these

    should I just make a new array and have half my 100milion values in one and 100million in the other?
    Same question for dictionaries, thanks!
     
  2. Antistone

    Antistone

    Joined:
    Feb 22, 2014
    Posts:
    2,833
    200 million entries is really large. If each element is a 64-bit handle (8 bytes), that's about 1.5 GB for the array (not even counting the objects being referenced!). I could imagine that needing to keep a contiguous block of memory that large could be causing various inefficiencies in the memory manager.

    How much RAM does your computer have, and how much is your game currently using when it runs?

    You might want to think if there's any way you could divide your game into pieces and keep only certain pieces in RAM at any given moment, swapping in and out as necessary.


    Dictionaries use more memory per element, but not a factor of 100 more, so no reason springs to mind for a 2 million-entry Dictionary to crap out. But Dictionaries are implemented as hash tables, and the performance characteristics of hash tables are complex, and the performance "guarantees" are only probabilistic.

    If your Dictionary is using a hash function that isn't efficient for the actual range of keys you're inserting, then you could be getting performance problems due to collisions.
     
    Madgvox and Yoreki like this.
  3. brigas

    brigas

    Joined:
    Oct 4, 2014
    Posts:
    522
    for the arrays, they are arrays of byte, I have been playing close attention to how may ram behaves and the 200 million array takes 200mb of ram

    for the dictionaries, I have a very efficient hash algorithm.

    My pc has 32gb ram never but never goes above 800mb while ingame(I have other elements), except when I test the array with 500 million entries, which of course takes 300mb extra

    But the 500 million is extremely slow just waiting for it to fill takes 6 minutes whiled the 200 takes 30sec, starts off very fast but gets slow towards the end.

    Thats why I was wondering if splitting the arrays in 2 would improve performance?
    Same for the dictionaries, considering I have good hashing?
     
  4. TurboNuke

    TurboNuke

    Joined:
    Dec 20, 2014
    Posts:
    69
    Could we see your code?
     
  5. brigas

    brigas

    Joined:
    Oct 4, 2014
    Posts:
    522
    its just a jagged array[x][y][z]

    coroutine to fill it with bytes = 1

    Code (CSharp):
    1.  
    2. array=new byte[800][][]
    3. for (x=0;x<800;x++){
    4. array[x]= new byte [800][]
    5. for (y=0;y<800;y++){
    6. array[x][y]= new byte [800]
    7. for (z=0;z<800;z++){
    8.  
    9. array[x][y][z]=1;
    10.  
    11. timer++;
    12. if(timer >100000)
    13. {timer =0;yield return null;}
    14. }
    15.  
    16. }
    17. }
     
    Last edited: Feb 27, 2020
  6. orionsyndrome

    orionsyndrome

    Joined:
    May 4, 2014
    Posts:
    3,043
    what is an efficient hash algorithm?
    did you pick one based on its speed or its reliability?
    hash algorithms are like Pokemons, there is no perfect Pokemon in all respects.

    either your hashing algorithm is slow, or it's producing too many collisions, choking your Dictionary.
    what you described sounds like the latter. sounds like the hash buckets are getting too deep and have to be inspected in O(n) time, every time you add something to it.

    hash codes are 32-bit, meaning they can be used for a hash space of 2^32 states.
    your algorithm needs to distribute well over this range, and with 200M entries you're at 5% of it, which doesn't sound as bad, so in theory it should work quite nicely (unless you're producing a lot of garbage in the process).

    how exactly did you implement your hashing algorithm? this is really important.
    when you're nearing such volumes you really have to be mindful of every little detail.
    because any inefficiency will snowball dramatically, ultimately leading to chokepoints of death.

    also what are your key and value data types?
     
  7. brigas

    brigas

    Joined:
    Oct 4, 2014
    Posts:
    522
    here is my hashing struct
    Code (CSharp):
    1.    public struct KeyHashableVector3Int : IEquatable<KeyHashableVector3Int>
    2.     {
    3.  
    4.         public int X;
    5.         public int Y;
    6.         public int Z;
    7.  
    8.         public KeyHashableVector3Int(int x, int y, int z)
    9.         {
    10.             X = x;
    11.             Y = y;
    12.             Z = z;
    13.         }
    14.  
    15.         public override string ToString() => nameof(KeyHashableVector3Int);
    16.  
    17.         public override bool Equals(object obj) => obj is KeyHashableVector3Int other && Equals(other);
    18.  
    19.         public bool Equals(KeyHashableVector3Int other)
    20.         {
    21.             return X == other.X &&
    22.                    Y == other.Y &&
    23.                    Z == other.Z;
    24.         }
    25.  
    26.         public override int GetHashCode()
    27.         {
    28.             unchecked
    29.             {
    30.                 int hash = 17;
    31.                 hash = hash * 23 + X.GetHashCode();
    32.                 hash = hash * 23 + Y.GetHashCode();
    33.                 hash = hash * 23 + Z.GetHashCode();
    34.                 return hash;
    35.             }
    36.         }
    37.     }
    38.  
    this struct as key, value is byte.

    Im more concerned with the arrays though
     
  8. orionsyndrome

    orionsyndrome

    Joined:
    May 4, 2014
    Posts:
    3,043
    that's a schoolbook bad, very bad hash algorithm.
    in fact, that's not a hashing algorithm at all.

    this is a very simple hash generator that I've seen around that is typically used to store points and such.
    it serves well when the system is not under stress, but it has a poor distribution behavior and it should never be used for such demanding hashing.

    look up for xxhash and implement that instead (take your time). and then come back.
     
  9. orionsyndrome

    orionsyndrome

    Joined:
    May 4, 2014
    Posts:
    3,043
    as for the arrays, I think you should look up how to configure garbage collection properly.
    you have to consider that you're approaching a non-standard C# territory.

    gc has a flag for allowing very large objects in memory, in order to treat them specially.
    I'd advise using a solution that splits the array into many smaller arrays behind the curtain, I've seen a couple of them, BigArray and such.

    in fact, I'd advise against such extensive sizes in the first place, I can't see why would anyone need data so big in a game engine, but hey. good luck with the customer support and maintenance I guess.
     
  10. TurboNuke

    TurboNuke

    Joined:
    Dec 20, 2014
    Posts:
    69
    You're using jagged arrays, which I'm guessing will be giving you serious cache issues , jumping around all over the place. Just as a test, could you try linearly filling a 1d array of size 800x800x800 and compare the speeds?
     
    orionsyndrome likes this.
  11. brigas

    brigas

    Joined:
    Oct 4, 2014
    Posts:
    522
    alright I'll test that later
    could you show me the implementation of that good one? How would you edit this struct?
     
  12. TurboNuke

    TurboNuke

    Joined:
    Dec 20, 2014
    Posts:
    69
    I just tested filling a single array vs your method. I get 2.5s (jagged) vs 1.8s (linear). This will depend on CPU / cache sizes etc. But i realise having it in a coroutine with those tests will also make a big difference.

    In your test code you'll be yielding 5120 times, which at 60fps will take 85 seconds of wait time.
     
  13. orionsyndrome

    orionsyndrome

    Joined:
    May 4, 2014
    Posts:
    3,043
    https://github.com/noricube/xxHashSharp as an example

    this is also just a snippet, also this is a class, not a struct, but other than ReferenceEquals there are no significant differences. _xxhash is a static reference that is seeded and prepared before its usage here. hash code of 0 is not valid in this code, and there are also some other cornercases which I also check in other places, to guarantee that my hash codes absolutely never produce a collision, while staying outside of reserved ranges.

    the external check is slow-ish, and implements a walking algorithm to try and hit an unused hash in a distributed manner, but I do this once per instantiation (extremely important), and I never swarm the system.

    Code (csharp):
    1.     static public bool operator ==(Face a, Face b) {
    2.       if(ReferenceEquals(null, a)) return ReferenceEquals(null, b);
    3.       return a.Equals(b);
    4.     }
    5.  
    6.     static public bool operator !=(Face a, Face b) { return !(a == b); }
    7.  
    8.     public override bool Equals(object obj) => Equals(obj as Face);
    9.  
    10.     public bool Equals(Face other) {
    11.       return !ReferenceEquals(null, other) &&
    12.              other.BelongsTo(Mesh) &&
    13.              other.ContainsInOrder(_ids);
    14.     }
    15.  
    16.     public override int GetHashCode() {
    17.       if(_hashCode == 0) _hashCode = CalcHashCode(Mesh, _ids);
    18.       return _hashCode;
    19.     }
    20.  
    21.     static public int CalcHashCode(Mesh mesh, in Id3 ids) {
    22.       int index = ids.IndexOfMin;
    23.  
    24.       var hashCode = mesh.GetHashCode();
    25.       hashCode = (hashCode << 16) | (hashCode >> 16);
    26.       hashCode ^= (int)_xxhash.GetHash(ids[index], ids[index.NextOf(NODES)], ids[index.PrevOf(NODES)]);
    27.  
    28.       // reserved range
    29.       if(hashCode >= 0 && hashCode < GROUPS) hashCode = -hashCode ^ 0x10101010;
    30.  
    31.       return hashCode;
    32.     }

    the walking algorithm (not as important, but if you're curious)
    Code (csharp):
    1.     [MethodImpl(MethodImplOptions.AggressiveInlining)]
    2.     void hashCodeWalk(ref int hashCode) {
    3.       unchecked {
    4.         if(!hashCode.IsEven()) {
    5.           hashCode = ~(hashCode - 2);
    6.         } else if(hashCode < 0) {
    7.           hashCode--;
    8.         } else {
    9.           hashCode++;
    10.         }
    11.       }
    12.       if(hashCode == 0) hashCode = -1; // 0 is invalid
    13.     }

    I have tested this with tens of millions of objects (which is an extensive test), it works like a charm.
     
  14. brigas

    brigas

    Joined:
    Oct 4, 2014
    Posts:
    522
    Thanks for testing that! SO what is slowing me down seems like something else, what would be the speed for filling the array from a large list the same size of the array? like this:

    Code (CSharp):
    1. for(x=0;x<800;x++)
    2. {
    3. for(y=0;y<800;y++)
    4. {
    5. for(z=0;z<800;z++)
    6. {
    7.  
    8. array[x][y][z] = list[timer];
    9. timer++;
    10. }
    11.  
    12. }
    13.  
    14. }
    so the list would be 500million bytes aswell
     
  15. brigas

    brigas

    Joined:
    Oct 4, 2014
    Posts:
    522
    thanks!
     
  16. orionsyndrome

    orionsyndrome

    Joined:
    May 4, 2014
    Posts:
    3,043
    in fact, this was the reason I've implemented xxhash in the first place.
    with the first version, I had huge issues that bit me from behind.
    it didn't manifest itself gracefully, but thankfully, my intuition saved me in time.
    and it's just 3 ints that I had to serialize into hash -> turns out it's not that simple.

    as with anything in life really, with greater quantities comes greater responsibility.
     
  17. Antistone

    Antistone

    Joined:
    Feb 22, 2014
    Posts:
    2,833
    Note that int.GetHashCode() is implemented as "return this".

    I notice that with your current hash function that

    Hash(x, 0, 0) == Hash(0, 23*x, 0) == Hash(0, 0, 23*23*x)

    If X/Y/Z are spatial coordinates in a rectangular area, that's probably not ideal from a collision perspective. If you know in advance that each of your three coordinates will always be in the range [0, 800) then I suggest the hash function

    hash = x ^ (y << 10) ^ (z << 20);

    This gives a unique hash for every coordinate in the range you are using (and is probably faster than your current hash function, to boot).


    Also...this is probably why you tried a Dictionary in the first place, but just in case you haven't thought of it:
    If most of that giant array contains the same value most of the time, you should consider if there's any way you can store only the unusual values and not explicitly represent the default value.
     
    Last edited: Feb 27, 2020
  18. brigas

    brigas

    Joined:
    Oct 4, 2014
    Posts:
    522
    Thanks for the tip! Do I fix it like this?
    Code (CSharp):
    1.   public override int GetHashCode()
    2.         {
    3.             unchecked
    4.             {
    5.                 int hash = X ^ (Y << 10) ^ (Z << 20);
    6.                 return hash;
    7.             }
    8.         }
     
  19. Antistone

    Antistone

    Joined:
    Feb 22, 2014
    Posts:
    2,833
    Looks fine to me.
     
  20. brigas

    brigas

    Joined:
    Oct 4, 2014
    Posts:
    522
    edited my mistake
     
  21. orionsyndrome

    orionsyndrome

    Joined:
    May 4, 2014
    Posts:
    3,043
    exactly, and with xor'd integers its pretty easy to determine the hash space
    for x integers with a range less than 2^n, hash is guaranteed to be unique within 2^n^x
    or in other words it requires exactly n*x bits

    10"10"10 is just a schoolbook example of partitioning :)

    also floats are an issue, they can't be predicted as easily, and their actual bit pattern is normally inaccessible, though from what I've seen, internally they're mapped as is to an Int32.

    I've seen that hash gen that he used in sensitive algorithms that attempt to detect Vector3 point overlaps in O(1). for some reason, people think it's enough to multiply all the elements with some small prime numbers, BUT, to their credit, and to some extent this actually works with floats, unlike with ints, mostly because floats have a pretty random mantissa anyway, so that's that.

    doing this with integers is at least an order of magnitude worse than its intended application.

    you can make integers up to 1023 with that hash gen (in fact you can do Z up to 4095).

    if you make your x/y/z space 2048x1024x2048, you can do
    int hash = X ^ (Y << 11) ^ (Z << 21);
    :p
     
  22. brigas

    brigas

    Joined:
    Oct 4, 2014
    Posts:
    522
    Thanks for the reply, by the way if I have a struct with more fields do I just keep increasing by 10? example:
    Code (CSharp):
    1. int hash = X ^ (Y << 11) ^ (Z << 21) ^ (A << 31) ^ (B << 41) ^ (C << 51);
    Is there an upper limit for the amount of ints in a struct that can be hashed like this?
     
  23. Antistone

    Antistone

    Joined:
    Feb 22, 2014
    Posts:
    2,833
    There's a limit due to the fact that an int only has 32 bits.

    Numbers from 0 to 1023 only take 10 bits to represent. So we can use 10 bits for X, a completely separate 10 bits for Y, and another completely separate 10 bits for Z, and still fit them all into a 32-bit integer.

    If you only needed coordinates from 0 to 31, then you'd only need 5 bits per number, and you could store 6 numbers in an int without overlapping. On the other hand, if your coordinates go up to 4095, then you'd need 12 bits per number, and you could only fit 2 of them.

    If the total number of bits you're actually using across all numbers is more than 32, then you can't fit them into a single int without overlapping. In that case, you're forced to make some combinations hash to the same result. You want to try to make it so that combinations you are likely to use at the same time hash to different results.

    When two different keys hash to the same result, that's called a "collision". A Dictionary will still work, but it becomes less efficient the more collisions you have.

    A "perfect hash" where every result you care about maps to a different result isn't usually practical. But you happen to be lucky in that you get to have one in this particular case.
     
  24. brigas

    brigas

    Joined:
    Oct 4, 2014
    Posts:
    522
    Got it, thanks!
     
  25. orionsyndrome

    orionsyndrome

    Joined:
    May 4, 2014
    Posts:
    3,043
    this is why
    Code (csharp):
    1. ZZZZZZZZ ZZZYYYYY YYYYYXXX XXXXXXXX
    2. 33222222 22221111 111111
    3. 10987654 32109876 54321098 76543210
    4.            ^          ^           ^
    I've partitioned 32 bits of int into 11 bits of X, 10 bits of Y, and again 11 bits of Z
    that means that Y cannot represent more than 2^10 states or 1024 [0 .. 1023]
    likewise, X and Z cannot represent more than 2^11 states or 2048 [0 .. 2047]

    there are ways to make sure that the bits can't leak to other values' spaces.
    but this wouldn't yield anything of use to you in this case because it's a hash gen.