[ShowCase] Tweening (progress and performance optimizations)

tertle · Jul 19, 2019

Table of Contents
Problem 1: Memory Allocations (this post)
Problem 2: Performance concerns

Foreword
As I work on this project I thought it'd be interesting to write about some of the performance optimizations I make as someone may find them useful in the future. I've only worked on this for 2 days but I'm extremely excited by it and I felt the need to share.

This first post is really just going to discuss memory allocation, however I intend to update this thread in the future with progress and other tips.

Background
So a tweening library isn't new to ECS, there have been a few nice ones posted on here over the past year however they all follow a very similar approach. Each type of tween has its own system and component which is attached to an Entity.

For example I found was PlasticTween: https://github.com/PlasticApps/PlasticTween

Great library but I love performance and really love pushing DOTs to its limits and as many of you know constantly adding and removing components has limitations on performance. There are also other limitations, you can't have multiple of the same tween on an entity and if you're adding your tweens with ECB you run into issues of trying to add 2 of the same component in same frame.

I had this idea come to me suddenly the other day while I was lying down and I've basically been coding it since - a better and extremely fast tweening library.

Requirements
Lets start with some requirements.

- No components that have to be added or removed from entities
- No sync points
- Full thread optimization
- Multiple of the same tween attached to an entity
- Stateless

Solution
The solution I came up with was actually DynamicBuffers. Yep. How? Fake inheritance solution (somehwat inspired by Unity.Physics but implemented differently).

Instead of attaching a component to an entity, instead a simple Tween component is added to a buffer. This tween component is a bit special though and holds some specific data.

It currently looks something like this

Code (CSharp):

public struct TweenElement : ISystemStateBufferElementData

{

public Tween Value;

}

public unsafe struct Tween

{

private Header header;

}

public unsafe struct Header

{

public TweenType TweenType;

public byte Type;

public byte* Ptr;

}

However, this is not something you'd want to be adding yourself to an Entity. You would not even know where to start and would likely constantly make mistakes. Instead you use a custom TweenBuffer to add tweens to entity's. Externally I designed it to work work very similar to EntityCommandBuffer (internally its quite different.)

Code (CSharp):

[BurstCompile]

private struct AddTweenJob : IJobForEachWithEntity<TestComponent>

{

public TweenBuffer TweenBuffer;

public void Execute(Entity entity, int index, [ReadOnly] ref TestComponent c)

{

for (var i = 0; i < TweensPerEntity; i++)

{

this.TweenBuffer.MoveTo(entity, new float3(1));

}

}

}

This will eventually (in a later system) convert the Tween data, pin it and add it to the buffer on the Entity. The user never has to deal with any of it. The only sync point that ever exists is the first time the buffer is added to an Entity, after that no sync points are required for the entire system.

So the first hurdle in building this solution, and the point of this first post is the memory allocation of pinning the Tween.

Problem 1. Memory Allocation
Malloc
The first iteration I was inspired by BlocAllocator and and I Malloc was simply used like this.

Code (CSharp):

internal static Tween Create(void* ptr, int length)

{

byte* buffer = (byte*)UnsafeUtility.Malloc(length, 4, Allocator.Persistent);

UnsafeUtility.MemCpy(buffer, ptr, length);

var tween = default(Tween);

tween.header.Ptr = buffer;

This was good and worked well. However once I hit about 10k tweens being added/frame I started running into performance issues - it was taking about 4-6ms to allocate these components. Now many would probably find this to be fine, and truthfully for the majority of tasks you're not going to be adding 10k tweens every frame. However I was aiming for more.

Pooling
Instead of allocating every time I needed to create a new Tween, what if we could somehow pool our memory. There are some serious difficulties pooling like classic object pools where you'd keep a collection of each type you wanted to pool. It's a bit more difficult with structs, burst and threads.

So instead a large chunk of memory was allocated for the pool that was broken into chunks and index and this is used to create Tweens.

It looks a little like this. Just a huge chunk of allocated memory.

Code (CSharp):

public static void AllocatePool(int length, int size, out MemoryPoolData* outBuf)

{

var data = (MemoryPoolData*)UnsafeUtility.Malloc(

sizeof(MemoryPoolData),

UnsafeUtility.AlignOf<MemoryPoolData>(),

Allocator.Persistent);

data->Buffer = (byte*)UnsafeUtility.Malloc(size * length, 64, Allocator.Persistent);

You can kind of consider this Pool to be a custom native container that somewhat works like a non generic list (where every element takes the same memory, even if its not required.)

This has a few nice advantages. It's easy to dispose as we can just free the entire pool when shutting down, it also works in burst and a simple Interlock

Code (CSharp):

var indexOfIndex = Interlocked.Increment(ref this.data->Index) - 1;

lets us use this in threaded jobs.

After implementing this we can use it like this.

Code (CSharp):

int index = this.Pool.GetNextIndex();

byte* ptr = this.Pool.GetData(index);

// ..

// just copy our tween to the address the same as before

internal static Tween Create(void* tweenPtr, byte* buffer, int length)

{

UnsafeUtility.MemCpy(buffer, tweenPtr, length);

var tween = default(Tween);

tween.header.Ptr = buffer;

The index is returned separate to the memory for future use to allow efficient removing of Tweens from entities and returning memory to the pool. (Typing this though I've realized I can just reverse the index from the buffer address.)

Anyway, it's time for some results.

Benchmarks

This is bench-marked in Editor and you'd expect a significant boost in an actual build, but already you can see the huge difference a persistent block of memory makes. 20-30x performance increase rather than having to allocate every time.This was bench marked on a 6+ year old 4 core 3570k - the raw numbers are less important than the % difference between methods.

So the results are pretty staggering. Using this solution I can now add 100,000 tweens to entities in milliseconds. About 10x faster than allocating the tween every creation.

For comparison, it takes 193ms to add and remove 10,000 component tweens using entity command buffers (and that is on the main thread). This solution effectively adds 10x as many in 65x less time.

Where Next (Problem 2?)
You might be thinking, great you adding some data to entities but you aren't actually tweening anything. True, I haven't demonstrated anything today but the system already works for tweening. I'll be discussing the actual tweening jobs and optimizations next and how (fake) inheritance lets me efficiently execute well threaded work on all the tweens. Stay tuned.

Deleted User · Jul 21, 2019

tertle said: ↑

Foreword
As I work on this project I thought it'd be interesting to write about some of the performance optimizations I make as someone may find them useful in the future. I've only worked on this for 2 days but I'm extremely excited by it and I felt the need to share.

This first post is really just going to discuss memory allocation, however I intend to update this thread in the future with progress and other tips.

Background
So a tweening library isn't new to ECS, there have been a few nice ones posted on here over the past year however they all follow a very similar approach. Each type of tween has its own system and component which is attached to an Entity.

For example I found was PlasticTween: https://github.com/PlasticApps/PlasticTween

Great library but I love performance and really love pushing DOTs to its limits and as many of you know constantly adding and removing components has limitations on performance. There are also other limitations, you can't have multiple of the same tween on an entity and if you're adding your tweens with ECB you run into issues of trying to add 2 of the same component in same frame.

I had this idea come to me suddenly the other day while I was lying down and I've basically been coding it since - a better and extremely fast tweening library.

Requirements
Lets start with some requirements.

- No components that have to be added or removed from entities
- No sync points
- Full thread optimization
- Multiple of the same tween acting on an entity

Solution
The solution I came up with was actually DynamicBuffers. Yep. How? Fake inheritance solution (somehwat inspired by Unity.Physics but implemented differently).

Instead of attaching a component to an entity, instead a simple Tween component is added to a buffer. This tween component is a bit special though and holds some specific data.

It currently looks something like this

Code (CSharp):

public struct TweenElement : ISystemStateBufferElementData

{

public Tween Value;

}

public unsafe struct Tween

{

private Header header;

}

public unsafe struct Header

{

public TweenType TweenType;

public byte Type;

public byte* Ptr;

}

However, this is not something you'd want to be adding yourself to an Entity. You would not even know where to start and would likely constantly make mistakes. Instead you use a custom TweenBuffer to add tweens to entity's. Externally I designed it to work work very similar to EntityCommandBuffer (internally its quite different.)

Code (CSharp):

[BurstCompile]

private struct AddTweenJob : IJobForEachWithEntity<TestComponent>

{

public TweenBuffer TweenBuffer;

public void Execute(Entity entity, int index, [ReadOnly] ref TestComponent c)

{

for (var i = 0; i < TweensPerEntity; i++)

{

this.TweenBuffer.MoveTo(entity, new float3(1));

}

}

}

This will eventually (in a later system) convert the Tween data, pin it and add it to the buffer on the Entity. The user never has to deal with any of it. The only sync point that ever exists is the first time the buffer is added to an Entity, after that no sync points are required for the entire system.

So the first hurdle in building this solution, and the point of this first post is the memory allocation of pinning the Tween.

Problem 1. Memory Allocation
Malloc
The first iteration I was inspired by BlocAllocator and and I Malloc was simply used like this.

Code (CSharp):

internal static Tween Create(void* ptr, int length)

{

byte* buffer = (byte*)UnsafeUtility.Malloc(length, 4, Allocator.Persistent);

UnsafeUtility.MemCpy(buffer, ptr, length);

var tween = default(Tween);

tween.header.Ptr = buffer;

This was good and worked well. However once I hit about 10k tweens being added/frame I started running into performance issues - it was taking about 4-6ms to allocate these components. Now many would probably find this to be fine, and truthfully for the majority of tasks you're not going to be adding 10k tweens every frame. However I was aiming for more.

Pooling
Instead of allocating every time I needed to create a new Tween, what if we could somehow pool our memory. There are some serious difficulties pooling like classic object pools where you'd keep a collection of each type you wanted to pool. It's a bit more difficult with structs, burst and threads.

So instead a large chunk of memory was allocated for the pool that was broken into chunks and index and this is used to create Tweens.

It looks a little like this. Just a huge chunk of allocated memory.

Code (CSharp):

public static void AllocatePool(int length, int size, out MemoryPoolData* outBuf)

{

var data = (MemoryPoolData*)UnsafeUtility.Malloc(

sizeof(MemoryPoolData),

UnsafeUtility.AlignOf<MemoryPoolData>(),

Allocator.Persistent);

data->Buffer = (byte*)UnsafeUtility.Malloc(size * length, 64, Allocator.Persistent);

You can kind of consider this Pool to be a custom native container that somewhat works like a non generic list (where every element takes the same memory, even if its not required.)

This has a few nice advantages. It's easy to dispose as we can just free the entire pool when shutting down, it also works in burst and a simple Interlock

Code (CSharp):

var indexOfIndex = Interlocked.Increment(ref this.data->Index) - 1;

lets us use this in threaded jobs.

After implementing this we can use it like this.

Code (CSharp):

int index = this.Pool.GetNextIndex();

byte* ptr = this.Pool.GetData(index);

// ..

// just copy our tween to the address the same as before

internal static Tween Create(void* tweenPtr, byte* buffer, int length)

{

UnsafeUtility.MemCpy(buffer, tweenPtr, length);

var tween = default(Tween);

tween.header.Ptr = buffer;

The index is returned separate to the memory for future use to allow efficient removing of Tweens from entities and returning memory to the pool. (Typing this though I've realized I can just reverse the index from the buffer address.)

Anyway, it's time for some results.

Benchmarks
View attachment 451586

This is bench-marked in Editor and you'd expect a significant boost in an actual build, but already you can see the huge difference a persistent block of memory makes. 20-30x performance increase rather than having to allocate every time.This was bench marked on a 6+ year old 4 core 3570k - the raw numbers are less important than the % difference between methods.

So the results are pretty staggering. Using this solution I can now add 100,000 tweens to entities in milliseconds. About 10x faster than allocating the tween every creation.

For comparison, it takes 193ms to add and remove 10,000 component tweens using entity command buffers (and that is on the main thread). This solution effectively adds 10x as many in 65x less time.

Where Next (Problem 2?)
You might be thinking, great you adding some data to entities but you aren't actually tweening anything. True, I haven't demonstrated anything today but the system already works for tweening. I'll be discussing the actual tweening jobs and optimizations next and how (fake) inheritance lets me efficiently execute well threaded work on all the tweens. Stay tuned.
Click to expand...

Such a good post. Liked the part with malloc TLSF which is dramatically slow. Ever considered to use smmalloc? https://github.com/nxrighthere/Smmalloc-CSharp

eizenhorn · Jul 17, 2019

Just side question - AlignOf<T> in UnsafeUtility is hardcoded 4, seems is possible misaligned memory access?
For example if your MemoryPoolData something like:

Code (CSharp):

struct MemoryPoolData

{

public int A;

public long B;

}

In this case AlignOf should be 8 but UnsafeUtility shows 4, isn't it?

tertle · Jul 17, 2019

eizenhorn said: ↑

Just side question - AlignOf<T> in UnsafeUtility is hardcoded 4, seems is possible misaligned memory access?
For example if your MemoryPoolData something like:

Code (CSharp):

struct MemoryPoolData

{

public int A;

public long B;

}

In this case AlignOf should be 8 but UnsafeUtility shows 4, isn't it?
Click to expand...

I am aware AlignOf always returns 4 but I'm not really sure.

My pool data looks like this

Code (CSharp):

internal unsafe struct MemoryPoolData : IDisposable

{

public void* Buffer;

public void* Free;

public int Length;

public int Size;

public int Index;

wobes said: ↑

Such a good post. Liked the part with malloc Persistent TLS allocation which is dramatically slow. Ever considered to use smmalloc? https://github.com/nxrighthere/Smmalloc-CSharp
Click to expand...

I have not tried though at first glance it does not appear to be particularly job and burst friendly?

Deleted User · Jul 18, 2019

tertle said: ↑

I am aware AlignOf always returns 4 but I'm not really sure.

My pool data looks like this

Code (CSharp):

internal unsafe struct MemoryPoolData : IDisposable

{

public void* Buffer;

public void* Free;

public int Length;

public int Size;

public int Index;

I have not tried though at first glance it does not appear to be particularly job and burst friendly?
Click to expand...

It is , you may just call a static extern method with an IntPtr of smallocAllocator. The results are pretty impressive. Somewhere around Temp speed allocation. As for alignment, it will return 4 regardless of T size. Here is an approach that I have taken from @fholm

Code (CSharp):

public static int GetAlignmentForArrayElement(int elementSize)

{

switch (elementSize)

{

case 8: return 8;

case 16: return 16;

case 32: return 32;

case 64: return 64;

default: return 4;

}

}

Code (CSharp):

var alignment = GetAlignmentForArrayElement(sizeof(T));

tertle · Jul 18, 2019

wobes said: ↑

It is , you may just call a static extern method with an IntPtr of smallocAllocator. The results are pretty impressive. Somewhere around Temp speed allocation.
Click to expand...

I'd love to test but I can not for the life of me get smalloc (native) to compile to x64

-edit-

solved
closed visual studios, loaded rider, compiled, success...

-edit2-

got it working but it starts to crash unity once i'm doing around 5-10k/mallocs/frame. going to put it aside for now and will probably look at it again in future when i can be bothered figuring out what is up.

tertle · Jul 18, 2019

Problem 2. Performance concerns
Another day and here are some of the performance areas I've been concerned about, and why they have not turned out to be major issues!

BufferFromEntity.Exists(entity) is extremely fast.
I was concerned that my idea of only add buffer when required would require too many exists checks. So much that I in my first iteration tried to work around it by grouping however after realizing my solution wouldn't scale how I wanted (to multiple buffers as it'd add all buffers when 1 tween was added even if they were never used) I decided to go back to checking and benchmark it.

While again unscientific, iterating over 10,000 entities it did not even add 0.01ms to the profiler.

Yeah I'm going to stick with that.

Code (CSharp):

[NativeDisableParallelForRestriction]

public BufferFromEntity<TB> Buffer;

public NativeHashMap<Entity, Empty>.Concurrent EntitiesWithoutBuffer;

/// <inheritdoc />

public void ExecuteNext(Entity entity, T tween)

{

if (!this.Buffer.Exists(entity))

{

this.EntitiesWithoutBuffer.TryAdd(entity, default);

return;

}

Polling early out is ok!
The next issue with the design is, once an entity has a tween the buffer always exists.

To start with, let's have a sneak peak of the actual performance of the tweening operations. Here are 100,000 entities doing various translation activities and I'll discuss more how the Translation job works in the next post.

Again bench marked in editor with safety checks, expect speed ups in a build

So the issue here is, as we are not using components instead storing the tweens in a buffer, if these 100,000 entities are given a tween at some point they will forever have the buffer. This will mean they still show up in a query for this job.

A common suggestion on these forums is instead of adding/components simply poll and exit early. So if I do something like this

Code (CSharp):

[RequireComponentTag(typeof(TweenTranslationElement))]

[BurstCompile]

internal struct TranslationJob : IJobForEachWithEntity<Translation> // IJobForEach_B still broken

{

[NativeDisableParallelForRestriction]

public BufferFromEntity<TweenTranslationElement> Buffer;

/// <inheritdoc />

public void Execute(Entity entity, int index, ref Translation translation)

{

var tweens = this.Buffer[entity];

if (tweens.Length == 0)

{

return;

}

how costly is this really going to be?

A bit more than I'd like, but not too bad.
Again this is 100,000 entities but it goes to show how fast performing actual Translations are. For the record, I did quickly test this in a build without safety checks and it was taking around 0.95 to 1.2ms, so slightly faster but not hugely.

So the performance here is a little worse than I'd like for situations where you tween a lot of entities then they sit idle for a while, again this is 100k entities which is a lot. In the future to push the system even further, I might look at some type of solution that will slowly remove the buffer from groups of entities over time.

Greenwar · Jul 18, 2019

This is so good. I'm curious of how the performance is on a small number of entities and tweens (20-50) in comparison to a larger amount, considering that unsafe code sometimes only pays dividens on larger iterations.

tertle · Jul 18, 2019

Greenwar said: ↑

This is so good. I'm curious of how the performance is on a small number of entities and tweens (20-50) in comparison to a larger amount, considering that unsafe code sometimes only pays dividens on larger iterations.
Click to expand...

Oh performance of small entity count is so small it's not even really worth benching.

But here I go, 50 Entities, 50 tweens - All benched in editor with safety checks on.

Cost of adding 50 tweens - combined time of about ~0.05ms

Cost of executing 50 translation tweens, 0.01ms

Cost of removing 50 expired tweens 0.01ms

Greenwar · Jul 18, 2019

tertle said: ↑

Oh performance of small entity count is so small it's not even really worth benching.

But here I go, 50 Entities, 50 tweens - All benched in editor with safety checks on.

Cost of adding 50 tweens - combined time of about ~0.05ms

View attachment 452303

Cost of executing 50 translation tweens, 0.01ms

View attachment 452294

Cost of removing 50 expired tweens 0.01ms

View attachment 452306
Click to expand...

Yeah, that's negligible. Nice.
I was thinking, is there any room for using stackalloc here? You mentioned earlier that you were malloc'ing / frame which allocates on the heap if im not mistaken, unless it's used in a burstjob. I don't know, just thinking of ways to cram as much out of it as possible.

tertle · Jul 19, 2019

Greenwar said: ↑

Yeah, that's negligible. Nice.
I was thinking, is there any room for using stackalloc here? You mentioned earlier that you were malloc'ing / frame which allocates on the heap if im not mistaken, unless it's used in a burstjob. I don't know, just thinking of ways to cram as much out of it as possible.
Click to expand...

I no longer malloc every frame. That was the first post! My first attempts did malloc every frame but I found it way too slow so instead now I just alloc 1 chunk of memory and reuse that for all tweens.

The mention of the malloc every frame was regards to smmalloc crashing. As far as I can tell smmalloc appears to do something very similar and allocate a chunk of memory and reuses that. Getting a chunk of this memory is still called malloc in the library.

Search Unity

[ShowCase] Tweening (progress and performance optimizations)

tertle

Deleted User

Guest

eizenhorn

tertle

Deleted User

Guest

tertle

tertle

Greenwar

tertle

Greenwar

tertle

Search Unity

Unity ID

Useful Searches

[ShowCase] Tweening (progress and performance optimizations)

Guest

Guest