Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. We have updated the language to the Editor Terms based on feedback from our employees and community. Learn more.
    Dismiss Notice
  3. Join us on November 16th, 2023, between 1 pm and 9 pm CET for Ask the Experts Online on Discord and on Unity Discussions.
    Dismiss Notice

Question Can someone explain why a job is MUCH faster when run in a performance test?

Discussion in 'Entity Component System' started by Enzi, Apr 28, 2023.

  1. Enzi

    Enzi

    Joined:
    Jan 28, 2013
    Posts:
    910
    I'm hard stuck on explaining this and the explanations I have, well, I don't know if they are true.
    Basically I want to hear them from someone else.

    Here's what's happening. I've been writing performance tests for jobs. To keep them as close as possible to the runtime I'm setting up a World, loading subscenes, use the same data, etc...

    However, the job I'm testing runs so much faster in the performance test and the performance test itself has a weird characteristic.


    What you see here is such a performance test. Captured in edit mode.
    In the measure the System.Update is called and parallel scheduled. First the job runs really slow around 4ms. It's to be expected because memory is allocated.
    Then the job gets faster and faster to the point of being ridiculously fast.
    It goes from 4ms to 2.6, 2 and reaches a min time of 1.2. In the runtime the same job runs constantly around 3ms.

    So, yeah, can someone explain what's going on here? How is it possible that a job that runs at 3ms runtime can reach 1.2ms after a while?
    Is the CPU just getting better at calculating the job with predictions, cache hits and what have you?
    If so, how could I utilize this in runtime or is it out of the question because at runtime I obviously just can't run the same job over and over again?

    TBH, I've never seen such a performance characteristic. That's why I'm really confused about it.
    Also I've checked my code a 100 times to make sure that not anything is just skipped and leading to better results over time.

    The test itself uses:
    .WarmupCount(100)
    .MeasurementCount(500)

    As these results are kind of useless. Can I make my performance test better? Less Warmup, more Warmup? What would make sense?

    Thanks for any answers! :)
     
    bb8_1 and apkdev like this.
  2. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,993
    Without seeing code, there could be a whole universe of things causing this.
    • If your simulation is non-deterministic or you have inter-thread data, it might take a few iterations for the different cores to get in a rhythm.
    • Have you checked your clock speeds and thermals? Going from idle to heavy load might cause some power regulation circuitry to try and make sense of things and that might take a little bit before it gets into a good steady-state. You can check this by scheduling a heavy job or several right before your test and see if that impacts things.
    • It could be specific to your CPU's cache eviction pattern. The first few iterations it might be evicting the wrong cache lines.
    • It could be your OS scheduler reacting to the sudden change in load.
    • It could be clock speed or thermals issue with your system memory.
    • It could be an issue you only see today and never again.
    • It could be an issue that disappears after an update.
    • It could be a bug in your code.
    • It could be something not mentioned on this list.
     
    bb8_1, apkdev and Rukhanka like this.
  3. elliotc-unity

    elliotc-unity

    Unity Technologies

    Joined:
    Nov 5, 2015
    Posts:
    228
    My best guess looking at this + your discord posts on the same subject is that all your data and code stays resident in cache when running the many iterations in the performance test, and at runtime when other systems run, they evict your stuff from the cache as they do their own thing, and so your runtime performance is more similar to the early iterations of your performance test.

    But basically, I would likely just take the runtime version as the more accurate estimate of the two.

    Also I of course agree with Latios that it would be easier to diagnose with the code.
     
    DreamingImLatios likes this.
  4. Unifikation

    Unifikation

    Joined:
    Jan 4, 2023
    Posts:
    1,068
    Do you think you could reason about why performance mode MIGHT be faster?

    Just in case the justifiable suspicions of the OP might have some merit...
     
  5. Enzi

    Enzi

    Joined:
    Jan 28, 2013
    Posts:
    910
    This is the same stress test at runtime with most systems turned off.
    upload_2023-4-28_22-35-0.png


    And this is another stress test that runs SpellCastJob in Setup and then the same CalculateSpellJobFromParallelList.
    The same thing can be noticed here. It runs faster after some iterations.


    SpellCastJob goes down to 1.4ms in perf test from 2.75ms at runtime.
    CalculateSpellJobFromParallelList goes down to 2ms in perf test from 3.3ms at runtime.

    I can reproduce the same results over and over again so I can rule out background tasks.
    The HWiNFO tool showed me the same amount of multipliers in the perf test and runtime. The perf test is just a blip but the CPU (Ryzen 5900x) still spikes in Ghz.
    Other thing I noticed when looking at the main thread is that SpellCastJob is hardly worked on there but the CalculateJob is. So 23 work threads and 1 main thread, full running the CalculateJob.

    If it's cache eviction or any other CPU voodoo stuff I have little knowledge about. Can you even do something about it? To find out I could save nearly 1ms for those jobs is just too much for me to ignore. :D

    Going back to the runtime test I've been doing where I disabled all other systems. What could possibly and constantly screw up the cache like this? Perf test is just one frame but still a lot of jobs are scheduled. Warmup 100, measure 500. I'll look into to run over more than 1 frame but I don't know if the Unity Performance Framework even allows that.

    If you are asking if the same thing happens in other perf tests. No, none that I've found. The run the jobs at very constant speed and deviation is only very minimal. Better said, they don't speed up, at all.

    SpellCastJob runs on primarily chunk data but it does a lookup to get a blob with spell data. I imagine that this is the most likely culprit to stay in cache and speed it up dramatically. (I'll test this out and eliminate the cache miss)
    CalculateJob does the same thing with looking up the blob.
     
  6. Unifikation

    Unifikation

    Joined:
    Jan 4, 2023
    Posts:
    1,068
    Did Samsung create the Performance Test?
     
  7. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,993
    What is stopping you from learning? I suspect you would benefit a lot from having a deeper understanding of those sorts of things.

    Not directly, but often it will point you to inefficiencies in your data access.

    If its L3, than anything that happens on any CPU core between jobs could be interfering. That's a fairly big time gap.
    Perhaps you may want to see if running the same job multiple times in a row the same frame at runtime exhibits the same behavior as the performance test.
     
  8. Unifikation

    Unifikation

    Joined:
    Jan 4, 2023
    Posts:
    1,068

    "Hardcore hobbyist game developer, animator, writer, and musician"

    Jazz?
     
  9. Enzi

    Enzi

    Joined:
    Jan 28, 2013
    Posts:
    910
    Oh, nothing. I just find it hard to find good resources about it that talk about all this. So I just learn as I run into different things. If you happen to know some please tell. :) Also, I think we talked about AMD uProf once? Now that I need a bit more low level info on things. uProf seems really hard to read or get into. Did it get better for you or are you still profiling on an Intel?

    Yeah, good idea.

    edit: Yes, same thing as in the perf test. At least some expectation was confirmed. :D
     
    Last edited: Apr 29, 2023
  10. DreamingImLatios

    DreamingImLatios

    Joined:
    Jun 3, 2017
    Posts:
    3,993
    A good "computer org and architecture" textbook can get you a fair ways. Also, I believe the Unity Learn for ECS had a link to a whitepaper from 2008 that went into deep detail about caches and hardware prefetching. For example, a lot of people explain cache prefetching like railroad tracks for a train, but if you actually read that paper, you'll learn that it is actually more like a tripline that only takes a couple of touches to set off and runs only for about 4 kB. And there are multiple of these for a single CPU core.

    Still rocking an X99 5820K.

    Yeah. That feels like L3 caching both data and instructions. Never underestimate the benefits of temporal locality in L3. If you are stuck with random accesses due to the nature of the algorithm, that's still typically something you can bias towards and get real speedups.

    More like hardrock and metal, but replace the singer with a CGI artist. :p
     
    elliotc-unity and Enzi like this.
  11. Enzi

    Enzi

    Joined:
    Jan 28, 2013
    Posts:
    910
    I actually can't believe I have found an answer.
    I turned everything off. Every system, main camera (rendering) closed every background app.
    upload_2023-4-29_20-43-52.png
    Perf test numbers. Yay!

    Now what happens if I just enable the camera gain?
    upload_2023-4-29_20-44-29.png

    Ok, camera off and just Unity.Physics enabled?
    upload_2023-4-29_20-46-44.png

    So, there you have it. Render thread + Unity.Physics tank it.

    edit: I just want to make it clear that this is not the render thread or Unity.Physics fault. I can do the same thing with enabling one of my own disabled systems. It's just something that happens once you push too much data. The stress test chunk size of the spell casters is 140MB. I should have made that clear in the first post.
    Still, very interesting behaviour. I'm not sure how I can exploit this knowledge but at least I have more clue what's going on.

    Oh and another thing because tertle asked me what happens when I reduce the amount to like 50k. Chunks take up ~28MB then and runtime performance is the same as the min(!) perf performance.
     
    Last edited: Apr 29, 2023