Search Unity

How to share large amounts of memory in parallel for job?

Discussion in 'Entity Component System' started by pk1234dva, Sep 20, 2021.

  1. pk1234dva

    pk1234dva

    Joined:
    May 27, 2019
    Posts:
    84
    I've got a parallel job that will run across a potentially thousands of indices.
    However, in each execute(int index), I need a large amount of memory to use - somewhere around 16kB. Some of my thoughts on how to do this.

    1, The simplest solution would be to use stackalloc... but allocating that much memory, possibly in multiple threads... it doesn't seem like a great idea.

    2, Temp allocating could maybe work, though I doubt it's meant to be used for something like 16kB in multiple threads (and from some reading I've done by Jackson Dunstan, that seems like a bit too much, and it will start defaulting to TempJob automatically after the first few elements of the job), and it also just doesn't seem like a good idea to allocate a new NativeArray in each execute.

    3, [This seems like a valid option] Allocate 16kB * "max number of threads", and let each job have access to this array, making sure each only accesses the right slice of the job based on its own thread idx. A bit messy, but I can imagine it could work. Not too sure how to go about it though - using [NativeSetThreadIndex attribute] on a field in the parallel job seems to give me 1,2,3,4 from what I've tried, so it looks like it's working.

    But I'm not completely sure how to use it, and what values an int with [NativeSetThreadIndex attribute] takes. There's JobWorkerMaxCount which gives me 3, but then there's also MaxJobThreadCount 128- I don't know which gives me the upper bound on the integer with [NativeSetThreadIndex attribute] attribute. And I don't really understand the difference between those two.

    Any help or reference to links would be appreciated. I'm especially confused about how JobWorkerMaxCount differs from MaxJobThreadCount. I mean what else is there than worker threads as far as unity goes? Shouldn't JobWorkerMaxCount be pretty much the same as MaxJobThreadCount?
     
  2. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,759
    You can allocate per thread not index in a job. This works

    Code (CSharp):
    1. [BurstCompile]
    2. public unsafe struct TestJob : IJobFor
    3. {
    4.     // required as safety checks if containers are allocated when scheduling
    5.     [NativeDisableContainerSafetyRestriction]
    6.     private NativeArray<int> TestArray;
    7.  
    8.     public void Execute(int index)
    9.     {
    10.         if (!this.TestArray.IsCreated)
    11.         {
    12.             // This will only run once per thread
    13.             this.TestArray = new NativeArray<int>(16 * 1024 * 1024, Allocator.Temp);
    14.         }
    15.         else
    16.         {
    17.             // Optional if you need it cleared - or use a NativeList
    18.             UnsafeUtility.MemClear(this.TestArray.GetUnsafePtr(), this.TestArray.Length * UnsafeUtility.SizeOf<int>());
    19.         }
    20.  
    21.         // use array
    22.     }
    23. }
    This is cleaner than 3 imo.
     
  3. pk1234dva

    pk1234dva

    Joined:
    May 27, 2019
    Posts:
    84
    Thanks for the quick reply.

    I'm a bit confused by the example of MemClear. Is that something that's fine to do with Temp native arrays? As in, shouldn't you let Unity dispose of Temp allocations at the end of the frame or whenever it happens?

    Besides that, yes, that looks a lot nicer, thanks. I don't know the internals of how Parallel jobs work - from your example, it looks like each scheduled parallel job gets split into a certain amount of identical jobs (worker count related value?), so each has access to it's own fields. Otherwise I have no idea how the above could work.

    So how exactly does that happen does each split simply get a part of the indices (say 0-100 when input forEachCount for schedule is 400, and there's 4 workers?). Could you give me a reference of where I could look at how it happens?
     
  4. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,759
    If the array has already been created then it means a previous index already filled this array with data.. It's doing the memclear so that new index has a clean array. If this isn't required you can remove it for performance.

    Pretty much this. Pseudocode of a worker

    Code (CSharp):
    1. var job = yourjob;
    2.  
    3. while(Worker.RequestWorkRange(out var min, out var max))
    4. {
    5.     for(var i = min; i < max; i++)
    6.         job.Execute(i);
    7. }
    The actual code is in the UnityEngine, jetbrains decompiling it looks like

    Code (CSharp):
    1.     [StructLayout(LayoutKind.Sequential, Size = 1)]
    2.     internal struct ForJobStruct<T> where T : struct, IJobFor
    3.     {
    4.       public static readonly IntPtr jobReflectionData = JobsUtility.CreateJobReflectionData(typeof (T), (object) new IJobForExtensions.ForJobStruct<T>.ExecuteJobFunction(IJobForExtensions.ForJobStruct<T>.Execute));
    5.  
    6.       public static unsafe void Execute(
    7.         ref T jobData,
    8.         IntPtr additionalPtr,
    9.         IntPtr bufferRangePatchData,
    10.         ref JobRanges ranges,
    11.         int jobIndex)
    12.       {
    13. label_5:
    14.         int beginIndex;
    15.         int endIndex;
    16.         if (!JobsUtility.GetWorkStealingRange(ref ranges, jobIndex, out beginIndex, out endIndex))
    17.           return;
    18.         JobsUtility.PatchBufferMinMaxRanges(bufferRangePatchData, UnsafeUtility.AddressOf<T>(ref jobData), beginIndex, endIndex - beginIndex);
    19.         int num = endIndex;
    20.         for (int index = beginIndex; index < num; ++index)
    21.           jobData.Execute(index);
    22.         goto label_5;
    23.       }
    24.  
    25.       public delegate void ExecuteJobFunction(
    26.         ref T data,
    27.         IntPtr additionalPtr,
    28.         IntPtr bufferRangePatchData,
    29.         ref JobRanges ranges,
    30.         int jobIndex)
    31.         where T : struct, IJobFor;
    32.     }
     
  5. pk1234dva

    pk1234dva

    Joined:
    May 27, 2019
    Posts:
    84
    Thanks a ton for the explanations. I somehow thought MemClear actually frees memory, but that's not what it does of course.

    One issue I'm still wondering about, is what should I do if I can't make sure the Job will finish on the same frame as it starts. Is using [NativeSetThreadIndex attribute] and accessing certains parts of an array still a valid approach? What are the possible values it can take?

    I don't really understand the difference between JobWorkerMaxCount / MaxJobThreadCount.
     
  6. burningmime

    burningmime

    Joined:
    Jan 25, 2014
    Posts:
    845
    Isn't this exactly what stackalloc does? 16 KiB per thread is not a large amount of memory. If it's 16 MB per thread, tertle's solution is clearly the way to go, but the simpler way is just...

    Code (CSharp):
    1. [BurstCompile]
    2. public unsafe struct TestJob : IJobParallelFor
    3. {
    4.     public void Execute(int index)
    5.     {
    6.         byte* bytes = stackalloc byte[1024 * 16];
    7.     }
    8. }
    Stackalloc doesn't actually "allocate" anything, it just moves the stack register up and zeroes the memory. Once the function returns (eg you complete an index); it's just freed implicitly.
     
  7. tertle

    tertle

    Joined:
    Jan 25, 2011
    Posts:
    3,759
    Agreed at 16KB stackalloc is probably fine (i'm not aware of current stack limitations but i believe it was 1MB for a long time on windows?) but not everyone is comfortable using unsafe code.

    Ideally once Span is supported (I don't think it is yet right in 2020.3 right) this makes it a lot more comfortable.
     
    Last edited: Sep 22, 2021
  8. pk1234dva

    pk1234dva

    Joined:
    May 27, 2019
    Posts:
    84
    Thanks guys. Yeah, I guess stack alloc is fine, it just feels a bit off to me - I was under the impression that 16kB is not that small and it feels a bit off for stack allocation, but I might be going off from various posts on the internet I've seen that are too old... and technology has improved since then.
     
  9. Mortuus17

    Mortuus17

    Joined:
    Jan 6, 2020
    Posts:
    105
    If you don't actually need the
    memclear
    of
    stackalloc
    , you can decorate methods (and probably also jobs, haven't tested it yet) with the
    Unity.Burst.CompilerServices.SkipLocalsInitAttribute
    since Burst 1.5 for a considerable performance gain - 128 loop iterations in this case if compiling for AVX and if LLVM unrolls the loop 4 times.

    The modified version would be...:
    Code (CSharp):
    1. [BurstCompile]
    2. public unsafe struct TestJob : IJobParallelFor
    3. {
    4.     [SkipLocalsInit]
    5.     public void Execute(int index)
    6.     {
    7.         byte* bytes = stackalloc byte[1024 * 16];
    8.     }
    9. }
     
    burningmime likes this.
  10. Luxxuor

    Luxxuor

    Joined:
    Jul 18, 2019
    Posts:
    89
    While it can have some improvement to performance when not zeroing locals it is your responsibility to make sure all array elements are not reading garbage, uninitialized memory, so do be careful.
    When in doubt, profile.
     
  11. Mortuus17

    Mortuus17

    Joined:
    Jan 6, 2020
    Posts:
    105
    I think this goes without saying.
    But if you fill that array on the stack with data in a method yourself, the zeroing-out happens beforehand, still. Why? don't know. But the Burst User Guide and disassemblies suggest that it does happen.


    There is no need for that. With
    MOVAPS memory, register
    having a throughput of 1 instruction per clock cycle and 16KB having to be zeroed out, given a register size of 32 bytes (AVX), it will take about 512 clock cycles or 512 / 3 = 170 nano seconds at 3GHz. Basically unnoticeable but still a waste of performance and code size which might add up.
     
    Luxxuor likes this.
  12. Luxxuor

    Luxxuor

    Joined:
    Jul 18, 2019
    Posts:
    89
    I think they have to memclear it to stay compliant to the language rules of C#.

    You are absolutly right, just wanted to make clear that that potential gain might not be enough the potential trouble of reading out unintialized memory (esp when coming back after a few weeks/month and changing the methods logic), been burned by that before myself in C/C++ land :).

    BTW: Do I understand correctly from your numbers that memclear internally uses single scalar instructions?
     
  13. Mortuus17

    Mortuus17

    Joined:
    Jan 6, 2020
    Posts:
    105
    Yeah probably. Even though first setting anything to 0 and then to 1 makes the zeroing a dead operation. Those usually get compiled away but it's not the case with
    stackalloc
    .

    I feel ya. Luckily
    [SkipLocalsInit]
    is at the very top of a method, so it's easily spottable. Pro's and Con's ;)

    Nope.
    MOVAPS
    is an X86 SIMD instruction "move aligned packed singles" - move a vector of floats. I also mentioned a register size of 32 bytes, which is 256 bits and thus 4x the size of a 64 bit general purpose register, which requires support for the AVX ("Advanced Vector eXtensions") instruction set from 2011. This is the optimal case for Burst generated code - AVX512 (512 bit vectors) is currently not supported by Burst. If you compile for ARM Neon or X86 SSE ("Streaming SIMD Extensions") you have 128 bit registers, which doubles execution time at least (~ 340 nano seconds).
    And since LLVM unrolls loops aggressively, the
    memclear
    code looks something like this in pseudo code:

    Code (CSharp):
    1. v256 ZERO_REGISTER = zero out 256 bit register
    2. ulong ADDRESS_BASE
    3. ulong LAST_ADDRESS = ADDRESS_BASE + STACKALLOC_BYTES
    4.  
    5. LOOP:
    6. copy ZERO_REGISTER to address (ADDRESS_BASE + 0)
    7. copy ZERO_REGISTER to address (ADDRESS_BASE + 32)
    8. copy ZERO_REGISTER to address (ADDRESS_BASE + 64)
    9. copy ZERO_REGISTER to address (ADDRESS_BASE + 96)
    10.  
    11. ADDRESS_BASE += 128
    12. if ADDRESS_BASE != LAST_ADDRESS goto LOOP
    ... of course leaving out code for residuals (when less than 128 bytes have to be zeroed-out after the loop, which would happen in a scalar fashion).
     
    Luxxuor and apkdev like this.
  14. burningmime

    burningmime

    Joined:
    Jan 25, 2014
    Posts:
    845
    Is there an instruction to mark some of the pages as zeroed in the MMU instead of zeroing them manually?
     
  15. Mortuus17

    Mortuus17

    Joined:
    Jan 6, 2020
    Posts:
    105
    I'm 99% sure that there is not. It sounds more like an OS responsibility if anything - the only memory related hardware instructions I know of are a significant amount of cache instructions.
     
    burningmime likes this.
  16. burningmime

    burningmime

    Joined:
    Jan 25, 2014
    Posts:
    845
    Mortuus17 likes this.