ECS and Voxel engine

illinar · Dec 28, 2019

VERY interesting results here...

No chunks, straight up NativeArray<int>, converting 3D index to 1D and reading an int 1 million times: 1.7ms.

HashMap 1000,000 capacity 100,000 int64 also converted keys contained, 1 million reads: 2ms

In a job.

Almost no difference. Imagine with chunks. If you have chunks the HashMap is faster. So I'm going for hashmap as the easiest, simplest, and potentially fastest.

EDIT: Actually HashMap performance will vary a lot based on it's size. The bigger it is the slower it gets. And some other factors are in play. I can't reproduce 2ms result in a new test, the best I get is 4-8ms.

illinar · Dec 28, 2019

To put it into perspective though (!) accessing managed array 1 million times takes 31ms on my budget PC. So pretty impressive what HashMap and NativeArray "can do" in a job.

So with Burst even this monstrosity is as good as your standard "array[ i ]" :
VoxelBuffers[ChunkBuffers[ChunkBufferEntity][GetChunkIndex(chunkPosition)].ChunkEntity][GetVoxelIndex(voxelPosition)].Value;
Would you like a ridiculous number? Without burst 2ms on Native array become 130ms, so all credits go to Burst.

EDIT: can't make NativeHashMap that fast again. Native array is the fastest now. The idea is to have just one array with a world offset and move data in it when new chunks are loaded. Should be easy in my case. I will iterate over objects and will move their voxels. So I don't need to read and move all the "air" voxels. This can be split across many frames, so should be smooth.

Sarkahn · Dec 30, 2019

My idea turned out to be a bust more or less. I created a small test just to see if it would work and how fast. The idea was to take a massive list of block operations that are at arbitrary world positions and divide them up by chunk index.

I started with a million of these:

Code (CSharp):

[System.Serializable]

public struct ChunkOperation

{

public int3 Pos;

public int Value;

}

They were initialized to completely random world positions within a range of 10x10x10 chunks. I passed them into a processing job:

Code (CSharp):

var workWriter = _workMap.AsParallelWriter();

Entities

.WithoutBurst()

.ForEach((in DynamicBuffer<ChunkWork> workBuffer) =>

{

inputDeps = new DivideWork

{

ChunkSize = ChunkSize,

Operations = workBuffer.Reinterpret<ChunkOperation>().AsNativeArray(),

WorkWriter = workWriter

}.Schedule(workBuffer.Length, 256, inputDeps);

}).Run();

The job calculates each operation's chunk index and stuffs it into a NativeMultiHashMap:

Code (CSharp):

[BurstCompile]

struct DivideWork : IJobParallelFor

{

[WriteOnly]

public NativeMultiHashMap<int3, ChunkOperation>.ParallelWriter WorkWriter;

[ReadOnly]

public NativeArray<ChunkOperation> Operations;

public int3 ChunkSize;

public void Execute(int index)

{

var work = Operations[index];

var worldPos = work.Pos;

var chunkIndex = (int3)math.floor(worldPos / ChunkSize);

// Convert world position to local chunk position

work.Pos = (worldPos % ChunkSize + ChunkSize) % ChunkSize;

WorkWriter.Add(chunkIndex, work);

}

}

The idea being from there I can easily write the sorted operations to their respective chunks in parallel...but just writing to the map already takes ~12ms. Just to make sure it was actually the parallel write to the NMHM slowing it down I tried doing the same thing but pushing the results to a NativeArray and it goes down to ~1.5ms.

Kind of a bummer that NMHM is so slow for this purpose, but it seems the idea of taking arbitrary limitless 3d positions and sorting them nicely into buckets is not as simple as I was hoping - who'd have thought! Unfortunately the math required to avoid using the NMHM is beyond me so I'll have to rethink it...or just say screw it and settle for always writing to chunks in sequence.

illinar · Dec 30, 2019

You can also try 1D long key instead of int3. Get hash on int3 should be the bottleneck when used in NHM.

It is an interesting idea though. Intuitively I doubt that sorting overhead is smaller than random access or cache miss overhead, but my intuition is wrong about DOTS all the time.

So if NMHM is still slow with long keys (or local int keys), you can still convert all your operations to 1D indexes and sorting them in a native array becomes very easy in theory. Native list rather. The bummer is that you cant have a native array of native lists.

Otherwise you'd have a bunch of lists
NativeArray[NativeList[0-1000], NativeList[1000-2000], etc];
Then you'd convert voxel position to the 1D index, and you'd divide it by 1000 and you'd know which list to put it into.
buckets[voxelOperation.voxelIndex/1000].Add[voxelOperation];
Then you'd go through buckets in a parallel job. I'm pretty sure thought that all these extra write operations will be a much bigger overhead than reading and writing voxel data directly randomly with cache misses.

Sarkahn · Dec 30, 2019

Well for the sake of completeness I wrote the second half of the "DivideWork" method and wrote a linear version for comparison. Both cases are operating in an ideal scenario where all the chunks have already been initialized so they can just do their work.

Linear version:

Code (CSharp):

[BurstCompile]

struct PerformWork : IJob

{

[ReadOnly]

public NativeArray<ChunkOperation> Operations;

[ReadOnly]

public NativeHashMap<int3, Entity> ChunkMap;

[WriteOnly]

public BufferFromEntity<ChunkData> ChunkDataFromEntity;

public int3 ChunkSize;

public void Execute()

{

for( int i = 0; i < Operations.Length; ++i )

{

var work = Operations[i];

var worldPos = work.Pos;

var chunkIndex = (int3)math.floor(worldPos / ChunkSize);

// Convert world position to local chunk position

int3 p = (worldPos % ChunkSize + ChunkSize) % ChunkSize;

int localIndex = p.x + ChunkSize.x * (p.y + ChunkSize.y * p.z);

var chunkData = ChunkDataFromEntity[ChunkMap[chunkIndex]];

chunkData[localIndex] = work.Value;

}

}

}

Parallel version:

Code (CSharp):

[BurstCompile]

struct DivideWork : IJobParallelFor

{

[WriteOnly]

public NativeMultiHashMap<int3, ChunkOperation>.ParallelWriter WorkWriter;

[ReadOnly]

public NativeArray<ChunkOperation> Operations;

public int3 ChunkSize;

public void Execute(int index)

{

var work = Operations[index];

var worldPos = work.Pos;

var chunkIndex = (int3)math.floor(worldPos / ChunkSize);

// Convert world position to local chunk position

work.Pos = (worldPos % ChunkSize + ChunkSize) % ChunkSize;

WorkWriter.Add(chunkIndex, work);

}

}

[BurstCompile]

struct PerformWorkParallel : IJobNativeMultiHashMapVisitKeyValue<int3, ChunkOperation>

{

[NativeDisableParallelForRestriction]

public BufferFromEntity<ChunkData> ChunkDataFromEntity;

[ReadOnly]

public NativeHashMap<int3, Entity> ChunkMap;

public int3 ChunkSize;

public void ExecuteNext(int3 key, ChunkOperation work)

{

var p = work.Pos;

int localIndex = p.x + ChunkSize.x * (p.y + ChunkSize.y * p.z);

var chunkData = ChunkDataFromEntity[ChunkMap[key]];

chunkData[localIndex] = work.Value;

}

}

Complete source of my tests:

Code (CSharp):

using System.Collections;

using System.Collections.Generic;

using Unity.Burst;

using Unity.Collections;

using Unity.Collections.LowLevel.Unsafe;

using Unity.Entities;

using Unity.Jobs;

using Unity.Mathematics;

using UnityEngine;

using UnityEngine.Profiling;

using Random = UnityEngine.Random;

[System.Serializable]

public struct ChunkData : IBufferElementData

{

public int Value;

public static implicit operator int(ChunkData c) => c.Value;

public static implicit operator ChunkData(int v) => new ChunkData { Value = v };

}

[System.Serializable]

public struct ChunkWork : IBufferElementData

{

public ChunkOperation Value;

public static implicit operator ChunkOperation(ChunkWork c) => c.Value;

public static implicit operator ChunkWork(ChunkOperation v) => new ChunkWork { Value = v };

}

[System.Serializable]

public struct ChunkOperation

{

public int3 Pos;

public int Value;

}

public class ProcessHugeBuffersTest : JobComponentSystem

{

const int ChunkSizeX = 16;

const int ChunkSizeY = 16;

const int ChunkSizeZ = 16;

static int3 ChunkSize => new int3(ChunkSizeX, ChunkSizeY, ChunkSizeZ);

const int ChunkCountX = 10;

const int ChunkCountY = 10;

const int ChunkCountZ = 10;

NativeHashMap<int3, Entity> _chunkMap;

NativeMultiHashMap<int3, ChunkOperation> _workMap;

void CreateWork()

{

_workMap = new NativeMultiHashMap<int3, ChunkOperation>(WorkCount, Allocator.Persistent);

int maxX = ChunkSizeX * ChunkCountX;

int maxY = ChunkSizeY * ChunkCountY;

int maxZ = ChunkSizeZ * ChunkCountZ;

var workEntity = EntityManager.CreateEntity(typeof(ChunkWork));

//EntityManager.SetName(workEntity, "Chunk Work");

var workBuffer = EntityManager.GetBuffer<ChunkWork>(workEntity);

workBuffer.ResizeUninitialized(WorkCount);

for (int i = 0; i < WorkCount; ++i)

{

var operation = new ChunkOperation

{

Value = 5,

Pos = new int3(Random.Range(0, maxX - 1), Random.Range(0, maxY - 1), Random.Range(0, maxZ - 1)),

};

workBuffer[i] = operation;

}

}

void CreateChunks()

{

_chunkMap = new NativeHashMap<int3, Entity>(ChunkCountX * ChunkCountY * ChunkCountZ, Allocator.Persistent);

for (int x = 0; x < ChunkCountX; ++x)

for (int y = 0; y < ChunkCountY; ++y)

for (int z = 0; z < ChunkCountZ; ++z)

{

int3 index = new int3(x, y, z);

var e = EntityManager.CreateEntity(typeof(ChunkData));

//EntityManager.SetName(e, $"Chunk {index.ToString()}");

var data = EntityManager.GetBuffer<ChunkData>(e);

data.ResizeUninitialized(ChunkSizeX * ChunkSizeY * ChunkSizeZ);

for (int i = 0; i < data.Length; ++i)

data[i] = 0;

_chunkMap[index] = e;

}

}

protected override void OnCreate()

{

CreateChunks();

CreateWork();

}

protected override void OnDestroy()

{

_chunkMap.Dispose();

_workMap.Dispose();

}

[BurstCompile]

struct DivideWork : IJobParallelFor

{

[WriteOnly]

public NativeMultiHashMap<int3, ChunkOperation>.ParallelWriter WorkWriter;

[ReadOnly]

public NativeArray<ChunkOperation> Operations;

public int3 ChunkSize;

public void Execute(int index)

{

var work = Operations[index];

var worldPos = work.Pos;

var chunkIndex = (int3)math.floor(worldPos / ChunkSize);

// Convert world position to local chunk position

work.Pos = (worldPos % ChunkSize + ChunkSize) % ChunkSize;

WorkWriter.Add(chunkIndex, work);

}

}

[BurstCompile]

struct PerformWorkParallel : IJobNativeMultiHashMapVisitKeyValue<int3, ChunkOperation>

{

[NativeDisableParallelForRestriction]

public BufferFromEntity<ChunkData> ChunkDataFromEntity;

[ReadOnly]

public NativeHashMap<int3, Entity> ChunkMap;

public int3 ChunkSize;

public void ExecuteNext(int3 key, ChunkOperation work)

{

var p = work.Pos;

int localIndex = p.x + ChunkSize.x * (p.y + ChunkSize.y * p.z);

var chunkData = ChunkDataFromEntity[ChunkMap[key]];

chunkData[localIndex] = work.Value;

}

}

[BurstCompile]

struct PerformWork : IJob

{

[ReadOnly]

public NativeArray<ChunkOperation> Operations;

[ReadOnly]

public NativeHashMap<int3, Entity> ChunkMap;

[WriteOnly]

public BufferFromEntity<ChunkData> ChunkDataFromEntity;

public int3 ChunkSize;

public void Execute()

{

for( int i = 0; i < Operations.Length; ++i )

{

var work = Operations[i];

var worldPos = work.Pos;

var chunkIndex = (int3)math.floor(worldPos / ChunkSize);

// Convert world position to local chunk position

int3 p = (worldPos % ChunkSize + ChunkSize) % ChunkSize;

int localIndex = p.x + ChunkSize.x * (p.y + ChunkSize.y * p.z);

var chunkData = ChunkDataFromEntity[ChunkMap[chunkIndex]];

chunkData[localIndex] = work.Value;

}

}

}

const int WorkCount = 10000000;

protected override JobHandle OnUpdate(JobHandle inputDeps)

{

inputDeps = DoWorkParallel(inputDeps);

//inputDeps = DoWorkLinear(inputDeps);

return inputDeps;

}

JobHandle DoWorkParallel(JobHandle inputDeps)

{

var workMap = _workMap;

var workWriter = workMap.AsParallelWriter();

var chunkMap = _chunkMap;

Entities

.WithoutBurst()

.ForEach((in DynamicBuffer<ChunkWork> workBuffer) =>

{

inputDeps = new DivideWork

{

ChunkSize = ChunkSize,

Operations = workBuffer.Reinterpret<ChunkOperation>().AsNativeArray(),

WorkWriter = workWriter

}.Schedule(workBuffer.Length, 1024, inputDeps);

}).Run();

inputDeps = new PerformWorkParallel

{

ChunkDataFromEntity = GetBufferFromEntity<ChunkData>(false),

ChunkMap = chunkMap,

ChunkSize = ChunkSize

}.Schedule(_workMap, 512, inputDeps);

inputDeps = Job.WithCode(() =>

{

workMap.Clear();

}).Schedule(inputDeps);

return inputDeps;

}

JobHandle DoWorkLinear(JobHandle inputDeps)

{

var chunkMap = _chunkMap;

Entities

.WithoutBurst()

.WithReadOnly(_chunkMap)

.ForEach((in DynamicBuffer<ChunkWork> workBuffer) =>

{

inputDeps = new PerformWork

{

ChunkSize = ChunkSize,

Operations = workBuffer.Reinterpret<ChunkOperation>().AsNativeArray(),

ChunkDataFromEntity = GetBufferFromEntity<ChunkData>(false),

ChunkMap = chunkMap,

}.Schedule(inputDeps);

}).Run();

return inputDeps;

}

}

The results are from the profiler attached to a standalone build.

10000 (Ten thousand operations):
Linear: ~2.3ms
Parallel: Divide ~0.1ms, Work ~0.1ms, ~0.2ms total

100000 (One hundred thousand Operations):
Linear: ~4.2ms
Parallel: Divide ~1.15, Work ~1.15, ~3ms total

1000000 (One million operations):
Linear: ~41ms
Parallel: Divide ~12.5ms, Work ~40ms, ~52.5ms total

10000000 (Ten million operations):
Linear: ~408ms
Parallel: Divide ~116ms, Work ~443ms, ~559ms total

...doh. Literally the opposite of what I was going for, the parallel version performs better with low operation counts and gets worse as it scales up.

illinar said: ↑

You can also try 1D long key instead of int3. Get hash on int3 should be the bottleneck when used in NHM.
Click to expand...

But how can I use a 1D key on my NHM? The chunk index is meant to be "infinite" on the x and z axis, so converting that to a 1D key (without collisions) in the hash map is exactly the problem I need the hash map itself to solve, right?

illinar · Dec 31, 2019

1D index can work even with an infinite world if you take an area inside that world. Your conversion method will have to be aware of this area size and location. Sth like that:

Code (CSharp):

public static int GetVoxelIndex(int3 pos)

{

pos += areaOffset; //Offset from world's 0;

return pos.x + areaSizeY * (pos.y + areaSizeZ * pos.z);

}

If player get's close to the edge of that area you would need to move it and reindex voxels or readd them to hashmaps, but you have a lot of time so no performance impact, just some complexity.

Could someone just benchmark this:

Case #1: 1,000,000 sequential reads/writes from 1,000,000 item array. Indices taken from an array of precomputed sequential indices.
Case #2: 1,000,000 random reads/writes from 1,000,000 item array. Indices taken from an array of precomputed random indices.

From my benchmarks I learned that one should benchmark the base premise first. Like that Hashmap is much slower than NativeArray, or that cache misses are bad when reading randomly from a big array in this case.

Search Unity

Unity ID

Useful Searches

ECS and Voxel engine

illinar

illinar

Sarkahn

illinar

Sarkahn

illinar