Search Unity

[ShowCase] Tweening (progress and performance optimizations)

Discussion in 'Entity Component System' started by tertle, Jul 17, 2019.

  1. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,761
    Table of Contents
    Problem 1: Memory Allocations (this post)

    Problem 2: Performance concerns

    Foreword

    As I work on this project I thought it'd be interesting to write about some of the performance optimizations I make as someone may find them useful in the future. I've only worked on this for 2 days but I'm extremely excited by it and I felt the need to share.

    This first post is really just going to discuss memory allocation, however I intend to update this thread in the future with progress and other tips.

    Background
    So a tweening library isn't new to ECS, there have been a few nice ones posted on here over the past year however they all follow a very similar approach. Each type of tween has its own system and component which is attached to an Entity.

    For example I found was PlasticTween: https://github.com/PlasticApps/PlasticTween

    Great library but I love performance and really love pushing DOTs to its limits and as many of you know constantly adding and removing components has limitations on performance. There are also other limitations, you can't have multiple of the same tween on an entity and if you're adding your tweens with ECB you run into issues of trying to add 2 of the same component in same frame.

    I had this idea come to me suddenly the other day while I was lying down and I've basically been coding it since - a better and extremely fast tweening library.

    Requirements
    Lets start with some requirements.

    - No components that have to be added or removed from entities
    - No sync points
    - Full thread optimization
    - Multiple of the same tween attached to an entity
    - Stateless

    Solution
    The solution I came up with was actually DynamicBuffers. Yep. How? Fake inheritance solution (somehwat inspired by Unity.Physics but implemented differently).

    Instead of attaching a component to an entity, instead a simple Tween component is added to a buffer. This tween component is a bit special though and holds some specific data.

    It currently looks something like this

    Code (CSharp):
    1. public struct TweenElement : ISystemStateBufferElementData
    2. {
    3.     public Tween Value;
    4. }
    5.  
    6. public unsafe struct Tween
    7. {
    8.     private Header header;
    9. }
    10.  
    11. public unsafe struct Header
    12. {
    13.     public TweenType TweenType;
    14.     public byte Type;
    15.     public byte* Ptr;
    16. }
    However, this is not something you'd want to be adding yourself to an Entity. You would not even know where to start and would likely constantly make mistakes. Instead you use a custom TweenBuffer to add tweens to entity's. Externally I designed it to work work very similar to EntityCommandBuffer (internally its quite different.)

    Code (CSharp):
    1.         [BurstCompile]
    2.         private struct AddTweenJob : IJobForEachWithEntity<TestComponent>
    3.         {
    4.             public TweenBuffer TweenBuffer;
    5.  
    6.             public void Execute(Entity entity, int index, [ReadOnly] ref TestComponent c)
    7.             {
    8.                 for (var i = 0; i < TweensPerEntity; i++)
    9.                 {
    10.                     this.TweenBuffer.MoveTo(entity, new float3(1));
    11.                 }
    12.             }
    13.         }
    This will eventually (in a later system) convert the Tween data, pin it and add it to the buffer on the Entity. The user never has to deal with any of it. The only sync point that ever exists is the first time the buffer is added to an Entity, after that no sync points are required for the entire system.

    So the first hurdle in building this solution, and the point of this first post is the memory allocation of pinning the Tween.

    Problem 1. Memory Allocation
    Malloc
    The first iteration I was inspired by BlocAllocator and and I Malloc was simply used like this.

    Code (CSharp):
    1. internal static Tween Create(void* ptr, int length)
    2. {
    3.     byte* buffer = (byte*)UnsafeUtility.Malloc(length, 4, Allocator.Persistent);
    4.     UnsafeUtility.MemCpy(buffer, ptr, length);
    5.  
    6.     var tween = default(Tween);
    7.     tween.header.Ptr = buffer;
    This was good and worked well. However once I hit about 10k tweens being added/frame I started running into performance issues - it was taking about 4-6ms to allocate these components. Now many would probably find this to be fine, and truthfully for the majority of tasks you're not going to be adding 10k tweens every frame. However I was aiming for more.

    Pooling
    Instead of allocating every time I needed to create a new Tween, what if we could somehow pool our memory. There are some serious difficulties pooling like classic object pools where you'd keep a collection of each type you wanted to pool. It's a bit more difficult with structs, burst and threads.

    So instead a large chunk of memory was allocated for the pool that was broken into chunks and index and this is used to create Tweens.

    It looks a little like this. Just a huge chunk of allocated memory.

    Code (CSharp):
    1.         public static void AllocatePool(int length, int size, out MemoryPoolData* outBuf)
    2.         {
    3.             var data = (MemoryPoolData*)UnsafeUtility.Malloc(
    4.                 sizeof(MemoryPoolData),
    5.                 UnsafeUtility.AlignOf<MemoryPoolData>(),
    6.                 Allocator.Persistent);
    7.  
    8.             data->Buffer = (byte*)UnsafeUtility.Malloc(size * length, 64, Allocator.Persistent);
    You can kind of consider this Pool to be a custom native container that somewhat works like a non generic list (where every element takes the same memory, even if its not required.)

    This has a few nice advantages. It's easy to dispose as we can just free the entire pool when shutting down, it also works in burst and a simple Interlock

    Code (CSharp):
    1. var indexOfIndex = Interlocked.Increment(ref this.data->Index) - 1;
    lets us use this in threaded jobs.

    After implementing this we can use it like this.

    Code (CSharp):
    1. int index = this.Pool.GetNextIndex();
    2. byte* ptr = this.Pool.GetData(index);
    3.  
    4. // ..
    5.  
    6. // just copy our tween to the address the same as before
    7. internal static Tween Create(void* tweenPtr, byte* buffer, int length)
    8. {
    9.    UnsafeUtility.MemCpy(buffer, tweenPtr, length);
    10.    var tween = default(Tween);
    11.    tween.header.Ptr = buffer;
    12.  
    The index is returned separate to the memory for future use to allow efficient removing of Tweens from entities and returning memory to the pool. (Typing this though I've realized I can just reverse the index from the buffer address.)

    Anyway, it's time for some results.

    Benchmarks
    upload_2019-7-17_19-17-55.png

    This is bench-marked in Editor and you'd expect a significant boost in an actual build, but already you can see the huge difference a persistent block of memory makes. 20-30x performance increase rather than having to allocate every time.This was bench marked on a 6+ year old 4 core 3570k - the raw numbers are less important than the % difference between methods.

    So the results are pretty staggering. Using this solution I can now add 100,000 tweens to entities in milliseconds. About 10x faster than allocating the tween every creation.

    For comparison, it takes 193ms to add and remove 10,000 component tweens using entity command buffers (and that is on the main thread). This solution effectively adds 10x as many in 65x less time.

    Where Next (Problem 2?)
    You might be thinking, great you adding some data to entities but you aren't actually tweening anything. True, I haven't demonstrated anything today but the system already works for tweening. I'll be discussing the actual tweening jobs and optimizations next and how (fake) inheritance lets me efficiently execute well threaded work on all the tweens. Stay tuned.
     
    Last edited: Jul 19, 2019
    UniqueCode, curbol, Igualop and 7 others like this.
  2. Deleted User

    Deleted User

    Guest

    Such a good post. Liked the part with malloc TLSF which is dramatically slow. Ever considered to use smmalloc? https://github.com/nxrighthere/Smmalloc-CSharp
     
    Last edited by a moderator: Jul 21, 2019
  3. eizenhorn

    eizenhorn

    Joined:
    Oct 17, 2016
    Posts:
    2,685
    Just side question - AlignOf<T> in UnsafeUtility is hardcoded 4, seems is possible misaligned memory access?
    For example if your MemoryPoolData something like:
    Code (CSharp):
    1.  
    2. struct MemoryPoolData
    3. {
    4.     public int A;
    5.     public long B;
    6. }
    7.  
    In this case AlignOf should be 8 but UnsafeUtility shows 4, isn't it?
     
    Last edited: Jul 17, 2019
  4. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,761
    I am aware AlignOf always returns 4 but I'm not really sure.

    My pool data looks like this

    Code (CSharp):
    1.     internal unsafe struct MemoryPoolData : IDisposable
    2.     {
    3.         public void* Buffer;
    4.         public void* Free;
    5.         public int Length;
    6.         public int Size;
    7.         public int Index;
    I have not tried though at first glance it does not appear to be particularly job and burst friendly?
     
  5. Deleted User

    Deleted User

    Guest

    It is , you may just call a static extern method with an IntPtr of smallocAllocator. The results are pretty impressive. Somewhere around Temp speed allocation. As for alignment, it will return 4 regardless of T size. Here is an approach that I have taken from @fholm

    Code (CSharp):
    1.         public static int GetAlignmentForArrayElement(int elementSize)
    2.         {
    3.             switch (elementSize)
    4.             {
    5.                 case 8: return 8;
    6.                 case 16: return 16;
    7.                 case 32: return 32;
    8.                 case 64: return 64;
    9.                 default: return 4;
    10.             }
    11.         }
    Code (CSharp):
    1.  var alignment = GetAlignmentForArrayElement(sizeof(T));
     
    Last edited by a moderator: Jul 18, 2019
  6. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,761
    I'd love to test but I can not for the life of me get smalloc (native) to compile to x64

    -edit-

    solved
    closed visual studios, loaded rider, compiled, success...

    -edit2-

    got it working but it starts to crash unity once i'm doing around 5-10k/mallocs/frame. going to put it aside for now and will probably look at it again in future when i can be bothered figuring out what is up.
     
    Last edited: Jul 18, 2019
  7. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,761
    Problem 2. Performance concerns
    Another day and here are some of the performance areas I've been concerned about, and why they have not turned out to be major issues!

    BufferFromEntity.Exists(entity) is extremely fast.
    I was concerned that my idea of only add buffer when required would require too many exists checks. So much that I in my first iteration tried to work around it by grouping however after realizing my solution wouldn't scale how I wanted (to multiple buffers as it'd add all buffers when 1 tween was added even if they were never used) I decided to go back to checking and benchmark it.

    While again unscientific, iterating over 10,000 entities it did not even add 0.01ms to the profiler.

    Yeah I'm going to stick with that.

    Code (CSharp):
    1.         [NativeDisableParallelForRestriction]
    2.         public BufferFromEntity<TB> Buffer;
    3.  
    4.         public NativeHashMap<Entity, Empty>.Concurrent EntitiesWithoutBuffer;
    5.  
    6.         /// <inheritdoc />
    7.         public void ExecuteNext(Entity entity, T tween)
    8.         {
    9.             if (!this.Buffer.Exists(entity))
    10.             {
    11.                 this.EntitiesWithoutBuffer.TryAdd(entity, default);
    12.                 return;
    13.             }
    Polling early out is ok!
    The next issue with the design is, once an entity has a tween the buffer always exists.

    To start with, let's have a sneak peak of the actual performance of the tweening operations. Here are 100,000 entities doing various translation activities and I'll discuss more how the Translation job works in the next post.

    upload_2019-7-18_15-51-45.png
    Again bench marked in editor with safety checks, expect speed ups in a build

    So the issue here is, as we are not using components instead storing the tweens in a buffer, if these 100,000 entities are given a tween at some point they will forever have the buffer. This will mean they still show up in a query for this job.

    A common suggestion on these forums is instead of adding/components simply poll and exit early. So if I do something like this

    Code (CSharp):
    1.     [RequireComponentTag(typeof(TweenTranslationElement))]
    2.     [BurstCompile]
    3.     internal struct TranslationJob : IJobForEachWithEntity<Translation> // IJobForEach_B still broken
    4.     {
    5.         [NativeDisableParallelForRestriction]
    6.         public BufferFromEntity<TweenTranslationElement> Buffer;
    7.  
    8.         /// <inheritdoc />
    9.         public void Execute(Entity entity, int index, ref Translation translation)
    10.         {
    11.             var tweens = this.Buffer[entity];
    12.             if (tweens.Length == 0)
    13.             {
    14.                 return;
    15.             }
    how costly is this really going to be?

    upload_2019-7-18_16-6-11.png

    A bit more than I'd like, but not too bad.
    Again this is 100,000 entities but it goes to show how fast performing actual Translations are. For the record, I did quickly test this in a build without safety checks and it was taking around 0.95 to 1.2ms, so slightly faster but not hugely.

    So the performance here is a little worse than I'd like for situations where you tween a lot of entities then they sit idle for a while, again this is 100k entities which is a lot. In the future to push the system even further, I might look at some type of solution that will slowly remove the buffer from groups of entities over time.
     
    psuong, Deleted User and Greenwar like this.
  8. Greenwar

    Greenwar

    Joined:
    Oct 11, 2014
    Posts:
    54
    This is so good. I'm curious of how the performance is on a small number of entities and tweens (20-50) in comparison to a larger amount, considering that unsafe code sometimes only pays dividens on larger iterations.
     
  9. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,761
    Oh performance of small entity count is so small it's not even really worth benching.

    But here I go, 50 Entities, 50 tweens - All benched in editor with safety checks on.

    Cost of adding 50 tweens - combined time of about ~0.05ms

    upload_2019-7-18_21-7-14.png

    Cost of executing 50 translation tweens, 0.01ms

    upload_2019-7-18_21-2-52.png

    Cost of removing 50 expired tweens 0.01ms

    upload_2019-7-18_21-7-56.png
     
    Last edited: Jul 18, 2019
    Seb-1814 likes this.
  10. Greenwar

    Greenwar

    Joined:
    Oct 11, 2014
    Posts:
    54
    Yeah, that's negligible. Nice.
    I was thinking, is there any room for using stackalloc here? You mentioned earlier that you were malloc'ing / frame which allocates on the heap if im not mistaken, unless it's used in a burstjob. I don't know, just thinking of ways to cram as much out of it as possible.
     
  11. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,761
    I no longer malloc every frame. That was the first post! My first attempts did malloc every frame but I found it way too slow so instead now I just alloc 1 chunk of memory and reuse that for all tweens.

    The mention of the malloc every frame was regards to smmalloc crashing. As far as I can tell smmalloc appears to do something very similar and allocate a chunk of memory and reuses that. Getting a chunk of this memory is still called malloc in the library.