Search Unity

Using multiple envs not speeding up training and not using more than 30% of CPU

Discussion in 'ML-Agents' started by Evercloud, May 3, 2020.

  1. Evercloud

    Evercloud

    Joined:
    Apr 29, 2013
    Posts:
    15
    Hello everybody,
    I am developing an arcade soccer agent and I have been training a model using Unity MLAgents 1.0.0, Unity 2019.3.12f1, PPO with RL, Self Play, Curiosity.
    Previously I have been training with unity editor but recently I decided to try multiple instances to optimize performances: I followed documentation instructions to build the environment and launch a training process with 8-12-16-32-64 instances but that did produce only a small performance increase (mainly due to the headless env) and CPU usage is roughly 25-30% regardless of the value of --num-envs and --time-step I set (on an AMD 2700X).
    I tried launching without the "--no-graphics" option to take a look and I noticed that each instance was lagging more and more as I increased the number of instances/envs, giving the impression that the same cpu amount was being split among all instances.
    As one would expect at this point, also steps per minute stay almost the same regardless of how many envs I use, at about 40000 steps every 100 seconds.
    Is there a way to improve cpu threads usage and performance?
    I'll be posting some tensorboard reports or code snippets if needed.
    Please advice :)
     
    Last edited: May 4, 2020
    Ordpers likes this.
  2. christophergoy

    christophergoy

    Unity Technologies

    Joined:
    Sep 16, 2015
    Posts:
    735
    Hi @Evercloud,
    I think tensorboard screen shots would be helpful. Also could you give us more information about your environment, agent action/observation space, reward function. All of these will help us better answer your questions.
     
  3. rz_0lento

    rz_0lento

    Joined:
    Oct 8, 2013
    Posts:
    2,361
    I'm struggling with this as well. I'm running 3900x here (so 12c/24t) and I get up to 30% utilization even if I put 24 envs for training (and higher amount of envs just timeout for me). I've only tested this food collector sample (non-visual one). Is there any way to make for example this sample to max the cpu during training?
     
    Ordpers likes this.
  4. christophergoy

    christophergoy

    Unity Technologies

    Joined:
    Sep 16, 2015
    Posts:
    735
    The problem with spinning up so many instances of unity is that it allocates graphics memory up front. So you will get diminishing returns rather quickly. You could duplicate the areas in the scene that you are training (if possible) to try and use more CPU without instantiating so many instances of Unity.
     
  5. Evercloud

    Evercloud

    Joined:
    Apr 29, 2013
    Posts:
    15
    Hey Christophergoy,

    thank you for your answer. I am setting up some tensoboard 200k step shots for you.
    In the meantime, I can provide you the other information:

    Agent:
    - "very arcade" / funny soccer with physics
    - 4 agents per env (1 field, 2x2 teams)
    - 228 observations: 147 for the rays facing forward, 63 for the rays going backward, plus 18 observations that I put manually. With rays the agent receives field borders, goal positions, other players positions, while with the other observations I pass the ball properties and some other aspects related to controls that I want the agent to be aware of (e.g. if it is grounded and so on)
    - 4 continous actions: 2 to move and 2 that I floor to int to get other controls (jump, low kick, right kick, etc)
    - I roughly add 1f reward when the agent scores a goal and -1 when it receives a goal plus some very small rewards and penalties to incentivate the behaviour I want (not kicking too much, keep close to the ball, etc)

    Hyperparameters:
    I saw the paramenters used in the soccer example and the resulting (already trained) agent model. I did not think the results of that example fit my project, so after many testing runs I decided to push the hyperparameters to the highest suggested values of the documentation and the results improved considerably, at the expenses of the training speed, and in particular: 512 nodes, 3 layers, 1e-4 learning rate.
    That's why I'm pushing on performance, as I study how to optimize in the meantime :)

    Please keep in mind that I am not looking for a super strong AI: I had achieved an almost unbeatable AI during my first iterations, but it was ugly to look at (stuttery, kicking all the time, etc) and it was hard and absolutely not rewarding. I am looking for a smooth, pleasant, human-like behaviour... a basic AI nice to look at and not too hard to beat that can become the bots brain in a very small multiplayer project.
     
    Last edited: May 5, 2020
  6. Evercloud

    Evercloud

    Joined:
    Apr 29, 2013
    Posts:
    15
    About the project and my machine configuration:
    Unity MLAgents 1.0.0, Unity 2019.3.12f1, PPO with RL, Self Play, Curiosity.
    My pc has a AMD 2700X, 16GB ram and an RTX 2070.
    I installed CUDA and cuDNN, fixing any unrecognition alerts on launch and now the GPU is being used during headless training, although, like the CPU, the gpu itself is poorly used with a 12-14% usage amount.

    You said: "You could duplicate the areas in the scene that you are training (if possible) to try and use more CPU without instantiating so many instances of Unity." I tried that too, with the same results.
    Please advice ;)
     
  7. rz_0lento

    rz_0lento

    Joined:
    Oct 8, 2013
    Posts:
    2,361
    The mentioned Food Collector I used as example already does this but it doesn't change anything for the cpu utilization regardless how many clones of training area I put there or what timescale I use. I can't make single env consume more than avg 17% of my cpu and after I keep adding the envs, even with more modest env counts, cpu avg's around 30% load (after initial spikes when starting the training).

    I'm curious, can Unity show example on for example that food collector that is capable of maxing the cpu utilization on some modern cpu? Like, 8c16t machine etc.
     
    Last edited: May 5, 2020
    Evercloud likes this.
  8. Evercloud

    Evercloud

    Joined:
    Apr 29, 2013
    Posts:
    15
    I'm uploading 4 screenshots:
    - 2 envs: cpu 18-20% 200k steps 798s
    - 4 envs: cpu 26-27% 200k steps 429s
    - 8 envs: cpu 26-27% 200k steps 406s
    - 16 envs: cpu 27-28% 200k steps 415s
    - Tensorboard recap (probably not relevant for such a small number of steps)

    *edit: 200k steps with 1 env directly on Unity editor takes about 1000s
     

    Attached Files:

    • 2.PNG
      2.PNG
      File size:
      48.7 KB
      Views:
      383
    • 4.PNG
      4.PNG
      File size:
      46.4 KB
      Views:
      393
    • 8.PNG
      8.PNG
      File size:
      50.7 KB
      Views:
      391
    • 16.PNG
      16.PNG
      File size:
      45.3 KB
      Views:
      390
    • 200Test.PNG
      200Test.PNG
      File size:
      149.8 KB
      Views:
      381
    Last edited: May 6, 2020
  9. Evercloud

    Evercloud

    Joined:
    Apr 29, 2013
    Posts:
    15
  10. christophergoy

    christophergoy

    Unity Technologies

    Joined:
    Sep 16, 2015
    Posts:
    735
    As a game engine, unity tries it's best to not saturate the CPU. It wants to use the CPU as efficiently as possible and will only saturate it if needed. The example environments are so lightweight that it's hard to reach 100% CPU utilization with them.

    There isn't a good answer (yet) on how to accomplish this. The best answer I can give you at the moment is you up the number of training areas in your scene if you can until you are using more CPU if upping the num-envs parameter isn't helping. I'm sorry it's a disappointing answer.
     
  11. ChrissCrass

    ChrissCrass

    Joined:
    Mar 19, 2020
    Posts:
    31
    What's strange is that this bottleneck only actually occurs when training.

    According to reports from rz_0lento, the 8 parallel environment setup has high CPU usage when the agents are not being trained.
     
  12. Evercloud

    Evercloud

    Joined:
    Apr 29, 2013
    Posts:
    15
    @christophergoy not a disappointing answer, no problem, I only would like to know more about how all this stuff works. :)

    I run more tests, as I had some free time this morning, and I was able to increase CPU usage by pushing up envs a lot:
    - 40% with 8 build instances x 24 envs each
    - 60% with 16 build instances x 48 (wow) envs each.

    Weirdly enough the best steps-per-minute boost I had was with 8x24 envs and not 16x48, so there is still much I am not understanding. Feel free to clarify about that.

    Following ChrissCrass intuitions, I have been running some tests with inference by changing Decision Requester settings, to try and understand what keeps my CPU busy. Unfortunately I could not get more than 30% usage in any way with inference. I don't know if that is pretty obvious, but, as you set Decision Period below 10, the AcademyFixedUpdateStepper starts keeping busy the CPU more and more, although without increasing cpu usage: I am attaching a profiler screenshot of a 30% used CPU laggy scene with a 48 envs and Decision Period to 5 (false Take Actions Between Decisions).
     

    Attached Files:

    Last edited: May 9, 2020
  13. Roboserg

    Roboserg

    Joined:
    Jun 3, 2018
    Posts:
    83
    The issue we don't understand is that adding additional training areas or environments (instances of Unity) does not make the training speed go faster (i.e. the number of steps per minute does not increase) after a certain amount of training areas / environments. There seems to be some bottleneck or artificial brakes in the ML agents / Unity, otherwise I don't understand how the CPU and GPU is below 100%, yet the training speed does not increase with additional training areas / environments. This is the issue we don't understand.
     
  14. harpj

    harpj

    Unity Technologies

    Joined:
    Jun 20, 2017
    Posts:
    6
    Hi @shebotnov, it's not certain that increasing the number of environments will improve the training speed. At a high level, ML-Agents runs inference to feed actions to each parallel environment worker and receives observations back from them. Series of actions and observations are combined into trajectories which fill a buffer.

    In a hypothetical case where you have a buffer size of 1000, trajectory length of 100, and 10 parallel environments (one agent each) you will fill the buffer after a single trajectory from each environment. In this case increasing the number of environments is unlikely to help and may actually be harmful by increasing CPU contention.

    In this sort of situation you have a few potential ways to improve performance:
    • improve the environment performance, so each step is collected faster
    • change the training configuration -- using a larger buffer may make each update more effective
    • improve the model update performance, since environments are idle while the model is updating
    It's hard to say what the most effective way to improve training performance is without benchmarking performance with a specific environment and configuration and trying out alternatives. Hope this is helpful to give you an idea where the bottlenecks might lie in ML-Agents training process, though.