Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

Question Questions about multithreaded vs single threaded performance with large memory access

Discussion in 'Entity Component System' started by funkyCoty, Jun 15, 2021.

  1. funkyCoty

    funkyCoty

    Joined:
    May 22, 2018
    Posts:
    679
    Hey there.

    This isn't necessarily specific to the Unity Job System, but it's something I've noticed quite a bit while using it. Quite often I'll have a job that does something simple, such as clear memory. I've found that having a single thread performing a UnsafeUtility.MemClear() can actually beat multiple threads running chunked MemClears()! My first assumption is that each thread ends up making other threads wait on data access. I'm not super knowledgeable on how this stuff works under the hood, so I'm wondering if anyone can point me to some resources which could explain this behavior?

    Here's an example screenshot. Same job, one multithreaded one single threaded. Only a 50% savings despite having 10 threads.

    single:
    Unity_2021-06-15_10-49-22.png


    multi: Unity_2021-06-15_10-48-42.png
     
    amarcolina likes this.
  2. mikaelK

    mikaelK

    Joined:
    Oct 2, 2013
    Posts:
    281
    its pretty much how you lay out your jobs I think.
    For example sometimes job doesn't need to wait for other job to finish.
    But yea you are right you can make your code so that its very inefficient or just throw extra threads to do something faster.

    You should read the Unity's entity documentation. google unity ecs

    You can use command buffers to queue jobs that require sync point at the end of the frame for example. That way you don't need to make sync points in the middle of frame that would require all the associated threads to complete before continuing
     
  3. funkyCoty

    funkyCoty

    Joined:
    May 22, 2018
    Posts:
    679
    In the example I posted, it's the same job. Just one job, no waits anywhere outside of the 1 job itself.
     
  4. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    So multithreaded summed time should never be less than the single-threaded counterpart. And there is an overhead for parallel jobs that can make it not worthwhile for certain tasks (like clearing memory). However, when the parallel job is 4x slower in summed time and the minimum worker thread is spending over a millisecond on it, that makes me believe something else is wrong. Safety checks? Jobs Debugger? Random accesses (especially writes)? Atomic operations? Parallel container instead of a single-threaded container? Volume of work dependent on framerate?
     
  5. mikaelK

    mikaelK

    Joined:
    Oct 2, 2013
    Posts:
    281
    Well in that case probably how the job is made. If you make your threads access every other element then there needs to be some kind of locking mechanism to prevent other threads fro reading and writing to those.

    If the job is laid out so that you know which thread is accessing which part you can remove those safety checks.
    Its impossible to say why the the parallel version is slower without any kind of code
     
  6. funkyCoty

    funkyCoty

    Joined:
    May 22, 2018
    Posts:
    679
    It's similar behavior on the UnsafeUtility.MemClear() example I had. I believed that if the writes were far enough apart in memory, it should be num_threads faster, but that hasn't seemed to be true at all from my tests.

    A simple job in pseudo-code:

    Code (CSharp):
    1.  
    2. struct clear_job : IJob / IJobParallelFor
    3. {
    4.   public NativeArray<byte> myData;
    5.  
    6.   // timing: 5ms
    7.   public void excecute()
    8.   {
    9.      MemClear(myData.UnsafePtr(), length) // length in bytes of course
    10.   }
    11.  
    12.   // timing, 4ms per job, 20ms total
    13.   public void execute(int index)
    14.   {
    15.      MemClear(myData.UnsafePtr(), length / job_thread_count + index * (length / job_thread_count))
    16.   }
    17. }
     
  7. amarcolina

    amarcolina

    Joined:
    Jun 19, 2014
    Posts:
    65
    You might actually be running into RAM bandwidth limits! It would be worthwhile taking a look at what write speed your RAM is rated for, and seeing what your maximum expected throughput might be.

    I ran a test where I cleared 100MB of memory. I was able to do so in ~7.5ms single threaded, but multi threaded it seems to cap out at ~3.5ms. 100 MB in 3.5ms gives a write speed of ~28,000MB/S, which seems to line up with average RAM speeds I'm finding online. I'd still love to know why a single thread can't execute fast enough to reach memory bandwidth, but I'm having a hard time finding detailed resources on this subject.
     
  8. mikaelK

    mikaelK

    Joined:
    Oct 2, 2013
    Posts:
    281
    If this is a write only operation the define that for the array. [ReadOnly] and [WriteOnly] tells the compiler the purpose of the memory array. Compiler can optimise it for read or write access.

    It can speed alot since the compiler can skip check related to reading or writing.
     
  9. mikaelK

    mikaelK

    Joined:
    Oct 2, 2013
    Posts:
    281
    Theoretically you should be able to read and write as fast as your cpu and memory can handle. More cores, then you might need faster memory.

    Also you missing [BurstCompile] !!!!!!

    Try something this:


    Code (CSharp):
    1. [BurstCompile]
    2. struct clear_job : IJob / IJobParallelFor
    3. {
    4.   [WriteOnly]
    5.   public NativeArray<byte> myData;
    6.  
    7.   public void excecute()
    8.   {
    9.      MemClear(myData.UnsafePtr(), length) // length in bytes of course
    10.   }
    11.  
    12.   public void execute(int index)
    13.   {
    14.      MemClear(myData.UnsafePtr(), length / job_thread_count + index * (length / job_thread_count))
    15.   }
    16. }
     
    Last edited: Jun 16, 2021
  10. calabi

    calabi

    Joined:
    Oct 29, 2009
    Posts:
    232
    This seems pretty standard behaviour from my experience . The real time of the multithreaded mode is actually quicker ,but the overall time is not. I was trying to figure this why this was the case a while ago I'm guessing because its all the overhead of starting and organising all the threads or something. Its still worth using multithreads because its quicker and makes more room for other stuff I think.
     
    mikaelK likes this.
  11. mikaelK

    mikaelK

    Joined:
    Oct 2, 2013
    Posts:
    281
    Also you can try deep profile and hierarchy mode. That will tell you what exactly is slowing down the mem clear.
    Usually there is some overhead from using multiple threads and waiting, but these can be minimized by cleverly creating the jobs
     
  12. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,983
    Alright. I have to ask: How big is the array you are trying to clear? @amarcolina may be right about this being a system memory bottleneck. What processor? What RAM? What channel configuration?
     
    amarcolina likes this.
  13. M_R

    M_R

    Joined:
    Apr 15, 2015
    Posts:
    558
    Code (CSharp):
    1.      MemClear(myData.UnsafePtr(), length / job_thread_count + index * (length / job_thread_count))
    2.  
    is not doing what you think.
    index = 0 will clear its segment (from ptr to ptr + (length/jobs)), but next jobs will clear again the previous segments, from 0 to index (and index = [length - 1] clears the entire array on it's own)

    it should be something like
    (ptr + index*batch_size, batch_size)
     
    DreamingImLatios likes this.
  14. funkyCoty

    funkyCoty

    Joined:
    May 22, 2018
    Posts:
    679
    Sorry yeah, that was just rough pseudo code I typed from my phone. What you described is closer to what I actually have.