How to share large amounts of memory in parallel for job?

pk1234dva · Sep 20, 2021

I've got a parallel job that will run across a potentially thousands of indices.
However, in each execute(int index), I need a large amount of memory to use - somewhere around 16kB. Some of my thoughts on how to do this.

1, The simplest solution would be to use stackalloc... but allocating that much memory, possibly in multiple threads... it doesn't seem like a great idea.

2, Temp allocating could maybe work, though I doubt it's meant to be used for something like 16kB in multiple threads (and from some reading I've done by Jackson Dunstan, that seems like a bit too much, and it will start defaulting to TempJob automatically after the first few elements of the job), and it also just doesn't seem like a good idea to allocate a new NativeArray in each execute.

3, [This seems like a valid option] Allocate 16kB * "max number of threads", and let each job have access to this array, making sure each only accesses the right slice of the job based on its own thread idx. A bit messy, but I can imagine it could work. Not too sure how to go about it though - using [NativeSetThreadIndex attribute] on a field in the parallel job seems to give me 1,2,3,4 from what I've tried, so it looks like it's working.

But I'm not completely sure how to use it, and what values an int with [NativeSetThreadIndex attribute] takes. There's JobWorkerMaxCount which gives me 3, but then there's also MaxJobThreadCount 128- I don't know which gives me the upper bound on the integer with [NativeSetThreadIndex attribute] attribute. And I don't really understand the difference between those two.

Any help or reference to links would be appreciated. I'm especially confused about how JobWorkerMaxCount differs from MaxJobThreadCount. I mean what else is there than worker threads as far as unity goes? Shouldn't JobWorkerMaxCount be pretty much the same as MaxJobThreadCount?

tertle · Sep 20, 2021

You can allocate per thread not index in a job. This works

Code (CSharp):

[BurstCompile]

public unsafe struct TestJob : IJobFor

{

// required as safety checks if containers are allocated when scheduling

[NativeDisableContainerSafetyRestriction]

private NativeArray<int> TestArray;

public void Execute(int index)

{

if (!this.TestArray.IsCreated)

{

// This will only run once per thread

this.TestArray = new NativeArray<int>(16 * 1024 * 1024, Allocator.Temp);

}

else

{

// Optional if you need it cleared - or use a NativeList

UnsafeUtility.MemClear(this.TestArray.GetUnsafePtr(), this.TestArray.Length * UnsafeUtility.SizeOf<int>());

}

// use array

}

}

This is cleaner than 3 imo.

pk1234dva · Sep 21, 2021

Thanks for the quick reply.

I'm a bit confused by the example of MemClear. Is that something that's fine to do with Temp native arrays? As in, shouldn't you let Unity dispose of Temp allocations at the end of the frame or whenever it happens?

Besides that, yes, that looks a lot nicer, thanks. I don't know the internals of how Parallel jobs work - from your example, it looks like each scheduled parallel job gets split into a certain amount of identical jobs (worker count related value?), so each has access to it's own fields. Otherwise I have no idea how the above could work.

So how exactly does that happen does each split simply get a part of the indices (say 0-100 when input forEachCount for schedule is 400, and there's 4 workers?). Could you give me a reference of where I could look at how it happens?

tertle · Sep 21, 2021

pk1234dva said: ↑

I'm a bit confused by the example of MemClear. Is that something that's fine to do with Temp native arrays? As in, shouldn't you let Unity dispose of Temp allocations at the end of the frame or whenever it happens?
Click to expand...

If the array has already been created then it means a previous index already filled this array with data.. It's doing the memclear so that new index has a clean array. If this isn't required you can remove it for performance.

pk1234dva said: ↑

Besides that, yes, that looks a lot nicer, thanks. I don't know the internals of how Parallel jobs work - from your example, it looks like each scheduled parallel job gets split into a certain amount of identical jobs (worker count related value?), so each has access to it's own fields. Otherwise I have no idea how the above could work.
Click to expand...

Pretty much this. Pseudocode of a worker

Code (CSharp):

var job = yourjob;

while(Worker.RequestWorkRange(out var min, out var max))

{

for(var i = min; i < max; i++)

job.Execute(i);

}

pk1234dva said: ↑

So how exactly does that happen does each split simply get a part of the indices (say 0-100 when input forEachCount for schedule is 400, and there's 4 workers?). Could you give me a reference of where I could look at how it happens?
Click to expand...

The actual code is in the UnityEngine, jetbrains decompiling it looks like

Code (CSharp):

[StructLayout(LayoutKind.Sequential, Size = 1)]

internal struct ForJobStruct<T> where T : struct, IJobFor

{

public static readonly IntPtr jobReflectionData = JobsUtility.CreateJobReflectionData(typeof (T), (object) new IJobForExtensions.ForJobStruct<T>.ExecuteJobFunction(IJobForExtensions.ForJobStruct<T>.Execute));

public static unsafe void Execute(

ref T jobData,

IntPtr additionalPtr,

IntPtr bufferRangePatchData,

ref JobRanges ranges,

int jobIndex)

{

label_5:

int beginIndex;

int endIndex;

if (!JobsUtility.GetWorkStealingRange(ref ranges, jobIndex, out beginIndex, out endIndex))

return;

JobsUtility.PatchBufferMinMaxRanges(bufferRangePatchData, UnsafeUtility.AddressOf<T>(ref jobData), beginIndex, endIndex - beginIndex);

int num = endIndex;

for (int index = beginIndex; index < num; ++index)

jobData.Execute(index);

goto label_5;

}

public delegate void ExecuteJobFunction(

ref T data,

IntPtr additionalPtr,

IntPtr bufferRangePatchData,

ref JobRanges ranges,

int jobIndex)

where T : struct, IJobFor;

}

pk1234dva · Sep 21, 2021

tertle said: ↑

If the array has already been created then it means a previous index already filled this array with data.. It's doing the memclear so that new index has a clean array. If this isn't required you can remove it for performance.]
Click to expand...

Thanks a ton for the explanations. I somehow thought MemClear actually frees memory, but that's not what it does of course.

One issue I'm still wondering about, is what should I do if I can't make sure the Job will finish on the same frame as it starts. Is using [NativeSetThreadIndex attribute] and accessing certains parts of an array still a valid approach? What are the possible values it can take?

I don't really understand the difference between JobWorkerMaxCount / MaxJobThreadCount.

burningmime · Sep 22, 2021

Isn't this exactly what stackalloc does? 16 KiB per thread is not a large amount of memory. If it's 16 MB per thread, tertle's solution is clearly the way to go, but the simpler way is just...

Code (CSharp):

[BurstCompile]

public unsafe struct TestJob : IJobParallelFor

{

public void Execute(int index)

{

byte* bytes = stackalloc byte[1024 * 16];

}

}

Stackalloc doesn't actually "allocate" anything, it just moves the stack register up and zeroes the memory. Once the function returns (eg you complete an index); it's just freed implicitly.

tertle · Sep 22, 2021

burningmime said: ↑

Isn't this exactly what stackalloc does? 16 KiB per thread is not a large amount of memory. If it's 16 MB per thread, tertle's solution is clearly the way to go, but the simpler way is just...

Code (CSharp):

[BurstCompile]

public unsafe struct TestJob : IJobParallelFor

{

public void Execute(int index)

{

byte* bytes = stackalloc byte[1024 * 16];

}

}

Click to expand...

Agreed at 16KB stackalloc is probably fine (i'm not aware of current stack limitations but i believe it was 1MB for a long time on windows?) but not everyone is comfortable using unsafe code.

Ideally once Span is supported (I don't think it is yet right in 2020.3 right) this makes it a lot more comfortable.

pk1234dva · Sep 22, 2021

Thanks guys. Yeah, I guess stack alloc is fine, it just feels a bit off to me - I was under the impression that 16kB is not that small and it feels a bit off for stack allocation, but I might be going off from various posts on the internet I've seen that are too old... and technology has improved since then.

Mortuus17 · Sep 23, 2021

burningmime said: ↑

[...] and zeroes the memory.
Click to expand...

If you don't actually need the
memclear
of
stackalloc
, you can decorate methods (and probably also jobs, haven't tested it yet) with the
Unity.Burst.CompilerServices.SkipLocalsInitAttribute
since Burst 1.5 for a considerable performance gain - 128 loop iterations in this case if compiling for AVX and if LLVM unrolls the loop 4 times.

The modified version would be...:

Code (CSharp):

[BurstCompile]

public unsafe struct TestJob : IJobParallelFor

{

[SkipLocalsInit]

public void Execute(int index)

{

byte* bytes = stackalloc byte[1024 * 16];

}

}

Luxxuor · Sep 23, 2021

Mortuus17 said: ↑
If you don't actually need the
memclear
of
stackalloc
, you can decorate methods (and probably also jobs, haven't tested it yet) with the
Unity.Burst.CompilerServices.SkipLocalsInitAttribute
since Burst 1.5 for a considerable performance gain - 128 loop iterations in this case if compiling for AVX and if LLVM unrolls the loop 4 times.

The modified version would be...:

Code (CSharp):

[BurstCompile]

public unsafe struct TestJob : IJobParallelFor

{

[SkipLocalsInit]

public void Execute(int index)

{

byte* bytes = stackalloc byte[1024 * 16];

}

}
Click to expand...
While it can have some improvement to performance when not zeroing locals it is your responsibility to make sure all array elements are not reading garbage, uninitialized memory, so do be careful.
When in doubt, profile.

Mortuus17 · Sep 23, 2021

Luxxuor said: ↑

While it can have some improvement to performance when not zeroing locals it is your responsibility to make sure all array elements are not reading garbage, uninitialized memory, so do be careful.
Click to expand...

I think this goes without saying.
But if you fill that array on the stack with data in a method yourself, the zeroing-out happens beforehand, still. Why? don't know. But the Burst User Guide and disassemblies suggest that it does happen.

Luxxuor said: ↑

When in doubt, profile.
Click to expand...

There is no need for that. With
MOVAPS memory, register
having a throughput of 1 instruction per clock cycle and 16KB having to be zeroed out, given a register size of 32 bytes (AVX), it will take about 512 clock cycles or 512 / 3 = 170 nano seconds at 3GHz. Basically unnoticeable but still a waste of performance and code size which might add up.

Luxxuor · Sep 23, 2021

Mortuus17 said: ↑

But if you fill that array on the stack with data in a method yourself, the zeroing-out happens beforehand, still. Why? don't know. But the Burst User Guide and disassemblies suggest that it does happen.
Click to expand...

I think they have to memclear it to stay compliant to the language rules of C#.

Mortuus17 said: ↑

There is no need for that. With
MOVAPS memory, register
having a throughput of 1 instruction per clock cycle and 16KB having to be zeroed out, given a register size of 32 bytes (AVX), it will take about 512 clock cycles or 512 / 3 = 170 nano seconds at 3GHz. Basically unnoticeable but still a waste of performance and code size which might add up.
Click to expand...

You are absolutly right, just wanted to make clear that that potential gain might not be enough the potential trouble of reading out unintialized memory (esp when coming back after a few weeks/month and changing the methods logic), been burned by that before myself in C/C++ land .

BTW: Do I understand correctly from your numbers that memclear internally uses single scalar instructions?

Mortuus17 · Sep 23, 2021

Luxxuor said: ↑

I think they have to memclear it to stay compliant to the language rules of C#.
Click to expand...

Yeah probably. Even though first setting anything to 0 and then to 1 makes the zeroing a dead operation. Those usually get compiled away but it's not the case with
stackalloc
.

Luxxuor said: ↑

(esp when coming back after a few weeks/month and changing the methods logic), been burned by that before myself in C/C++ land
Click to expand...

I feel ya. Luckily
[SkipLocalsInit]
is at the very top of a method, so it's easily spottable. Pro's and Con's

Luxxuor said: ↑

Do I understand correctly from your numbers that memclear internally uses single scalar instructions?
Click to expand...

Nope.
MOVAPS
is an X86 SIMD instruction "move aligned packed singles" - move a vector of floats. I also mentioned a register size of 32 bytes, which is 256 bits and thus 4x the size of a 64 bit general purpose register, which requires support for the AVX ("Advanced Vector eXtensions") instruction set from 2011. This is the optimal case for Burst generated code - AVX512 (512 bit vectors) is currently not supported by Burst. If you compile for ARM Neon or X86 SSE ("Streaming SIMD Extensions") you have 128 bit registers, which doubles execution time at least (~ 340 nano seconds).
And since LLVM unrolls loops aggressively, the
memclear
code looks something like this in pseudo code:

Code (CSharp):

v256 ZERO_REGISTER = zero out 256 bit register

ulong ADDRESS_BASE

ulong LAST_ADDRESS = ADDRESS_BASE + STACKALLOC_BYTES

LOOP:

copy ZERO_REGISTER to address (ADDRESS_BASE + 0)

copy ZERO_REGISTER to address (ADDRESS_BASE + 32)

copy ZERO_REGISTER to address (ADDRESS_BASE + 64)

copy ZERO_REGISTER to address (ADDRESS_BASE + 96)

ADDRESS_BASE += 128

if ADDRESS_BASE != LAST_ADDRESS goto LOOP

... of course leaving out code for residuals (when less than 128 bytes have to be zeroed-out after the loop, which would happen in a scalar fashion).

burningmime · Sep 24, 2021

Mortuus17 said: ↑
Yeah probably. Even though first setting anything to 0 and then to 1 makes the zeroing a dead operation. Those usually get compiled away but it's not the case with
stackalloc
.

I feel ya. Luckily
[SkipLocalsInit]
is at the very top of a method, so it's easily spottable. Pro's and Con's

Nope.
MOVAPS
is an X86 SIMD instruction "move aligned packed singles" - move a vector of floats. I also mentioned a register size of 32 bytes, which is 256 bits and thus 4x the size of a 64 bit general purpose register, which requires support for the AVX ("Advanced Vector eXtensions") instruction set from 2011. This is the optimal case for Burst generated code - AVX512 (512 bit vectors) is currently not supported by Burst. If you compile for ARM Neon or X86 SSE ("Streaming SIMD Extensions") you have 128 bit registers, which doubles execution time at least (~ 340 nano seconds).
And since LLVM unrolls loops aggressively, the
memclear
code looks something like this in pseudo code:

Code (CSharp):

v256 ZERO_REGISTER = zero out 256 bit register

ulong ADDRESS_BASE

ulong LAST_ADDRESS = ADDRESS_BASE + STACKALLOC_BYTES

LOOP:

copy ZERO_REGISTER to address (ADDRESS_BASE + 0)

copy ZERO_REGISTER to address (ADDRESS_BASE + 32)

copy ZERO_REGISTER to address (ADDRESS_BASE + 64)

copy ZERO_REGISTER to address (ADDRESS_BASE + 96)

ADDRESS_BASE += 128

if ADDRESS_BASE != LAST_ADDRESS goto LOOP

... of course leaving out code for residuals (when less than 128 bytes have to be zeroed-out after the loop, which would happen in a scalar fashion).
Click to expand...
Is there an instruction to mark some of the pages as zeroed in the MMU instead of zeroing them manually?

Mortuus17 · Sep 24, 2021

burningmime said: ↑

Is there an instruction to mark some of the pages as zeroed in the MMU instead of zeroing them manually?
Click to expand...

I'm 99% sure that there is not. It sounds more like an OS responsibility if anything - the only memory related hardware instructions I know of are a significant amount of cache instructions.

burningmime · Sep 24, 2021

Mortuus17 said: ↑

I'm 99% sure that there is not. It sounds more like an OS responsibility if anything - the only memory related hardware instructions I know of are a significant amount of cache instructions.
Click to expand...

Looked it up and yeah doesn't seem like something made easy to access from user mode: https://medium.com/@connorstack/how-does-an-os-enable-virtual-memory-696a8f75f274

Search Unity

Unity ID

Useful Searches

How to share large amounts of memory in parallel for job?