Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice
  2. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

Question How many Unity Instances to run to optimize training speeds

Discussion in 'ML-Agents' started by mcdenyer, Apr 20, 2021.

  1. mcdenyer

    mcdenyer

    Joined:
    Mar 4, 2014
    Posts:
    48
    So I am curious if anyone has suggestions on how to determine how many agents per how many instances is optimal for training speeds. My PC Build is kind of aged so I am trying to maximize what I can get out of it for training.
     
  2. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    473
    For CPU training, I'd say try as many as you like and check the overall CPU load. If there's a single agent in an environment, I usually duplicate it so there are 9 agents in a scene. Then I train with 8 headless/no-graphics executables, if agents don't have camera sensors. So that's 72 agents total and my CPU (8 cores, also a couple of years old) is busy somewhere between 50% and 100%, depending on environment specifics, like physics simulation etc. I think the default training time-scale is 20, but it's worth checking if the engine can actually handle that without any dropouts. If I do physics heavy stuff, I'd rather decrease the time-scale a bit to be on the safe side.
     
    mcdenyer likes this.
  3. mcdenyer

    mcdenyer

    Joined:
    Mar 4, 2014
    Posts:
    48
    I am doing fairly heavy physics stuff. What do you mean by dropouts? My training is very choppy but the end result still turns out nice? All my heavier physics updates are in fixed update.
     
  4. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    473
    Just making sure there are no physics glitches with higher time-scales, e.g. checking if fast collisions register reliably.
     
    mcdenyer likes this.
  5. unity_-DoCqyPS6-iU3A

    unity_-DoCqyPS6-iU3A

    Joined:
    Aug 18, 2018
    Posts:
    26
    I just ran some benchmarks on my environment a few days ago. I reduced the max_steps in config.yaml to a small number so the training will "finish" after a few minutes. I could then look at the timings.json file to see how long the training took.

    Of course you can change both "agents per environment" and "number of environments", giving you many possible combinations

    Luckily, from what I found is that those are mostly independent from each other from a "step throughput" perspective:
    If 6 environment perform best with 32 agents each, then 6 environments will also give me the fastest times with 64 or 16 agents. (opposed to needing to increase num_envs when I reduce the number of agents per environment).


    Maybe @mbaske can share if that aligns with what he has seen so far.

    I can then first find the best number for num-envs and then tune the number of agents per environment afterwards. (Or vice versa I guess).
    But my CPU is quite powerful so you may run into a different bottleneck.

    Also, having *too many* environments has one weird effect on performance for me: performance doesn't decrease steeply with one env too much, but it's just a subtle change in training times.
    1 env: slow
    2 envs: faster, maybe 3/4 of single-env time
    4: faster, maybe 2/3 of single-env time
    8: faster, maybe 1/2 of single-env time
    10: slightly slower, maybe 52% of single-env time
    12: 54% of single-env time.


    But at these scales you start to run into a different problem anyway: you may be tempted to think "more is better for parallelization" and come up with situations like this:
    6 environments
    32 agents each
    Time_horizon 100
    Buffer_size: 5000

    Mlagents will only add your trajectory to the buffer when it reaches "time-horizon"-steps - or the agent is "done".

    If the agents never terminates for the numbers above, then
    -the buffer is empty for some time
    -after 100 steps every agents has reached "time-horizon" steps
    -6 x 32 agents add their 100-step-experience to the buffer
    -the buffer now has 19200 entries.
    -this is much more than the intended 5000 steps

    This will still work of course, but since every policy-update uses much more steps
    -you don't update your policy as often as you may think during your training.
    -your benchmark results can't be compared fairly.

    You can either do the math to see if you're having too many agents for your particular buffer-size, or you can check the "count" for policy-updates in your timings.json file.

    A few more things I find useful for performance-optimization:
    -Disable threading in your config-file. There is a PR on github saying that it has no benefit any more with pytorch. Disabling gives you cleaner numbers in timings.json
    -Watch out for batch-size and your GPU-ram. Taskmanager shows you GPU-Ram usage. If it comes within 200mb of the maxumim, reduce your batch-size. I feel like that gives you better performance (although that may just be subjective)
     
    mcdenyer likes this.