Search Unity

Bug Training freezing every 10.000 steps for 60-80 seconds

Discussion in 'ML-Agents' started by Thorce, Jun 17, 2023.

  1. Thorce

    Thorce

    Joined:
    Jul 3, 2019
    Posts:
    41
    Hey guys!

    As mentioned in the title unity freezes periodically during training. It happens pretty much every 10k steps and then it freezes for 60-80 seconds. This effectively doubles my training time and is extremely annoying.

    I really appreciate any help on this!

    Here is the training output where I marked the points when the freeze happens. After that you can see my hyper parameters

    upload_2023-6-18_0-13-4.png

    upload_2023-6-18_0-14-26.png

    Unity: 2022.2.19
    MLAgents: 2.0.1
    MLAgents Py Package: 0.29.0 / 0.30.0
     
  2. Thorce

    Thorce

    Joined:
    Jul 3, 2019
    Posts:
    41
    Figured out that this is most likely due to the buffer_size being reached.
    Is there any way to optimize it so that my training time does not get doubled ?
     
  3. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    That's your buffer hitting full size and the model weights being updated. When the buffer_size is hit you'll run through a gradient descent for every batch_size in the buffer. If you watch the environment the simulation likely freezes at this point too. This is expected but you can probably speed it up using a fast GPU instead of the CPU for model updates. If your agent can handle the learning implications you can also lower the buffer_size to update more often but that just smooths out the spike, the cost of updating the model is always there.
     
  4. mcdenyer

    mcdenyer

    Joined:
    Mar 4, 2014
    Posts:
    48
    @Luke-Houlihan it's been my understanding that CPU > GPU for PPO, am I wrong?
     
  5. Thorce

    Thorce

    Joined:
    Jul 3, 2019
    Posts:
    41
    I solved my problem by using my gpu for training as @Luke-Houlihan suggested. Although using a gpu does not necessarily speed up the simulation, it does speed up updating the model by quite a bit.
    Using my gpu and training in a build with --no-graphics yielded a 8x training time reduction!
     
  6. mcdenyer

    mcdenyer

    Joined:
    Mar 4, 2014
    Posts:
    48
    interesting @Thorce I will have to try that out.
     
  7. mcdenyer

    mcdenyer

    Joined:
    Mar 4, 2014
    Posts:
    48
    @Thorce I was only able to cut 30 minutes off a 9hour run by using the GPU and I have a much larger network than you so I'd think I would see even a bigger difference than you.

    Mind telling me which GPU you have and which version of pytorch and which cuda version you are using? Also how many agents are you training per enviornment and how many environments are you running?
     
  8. Thorce

    Thorce

    Joined:
    Jul 3, 2019
    Posts:
    41
    I have an RTX 3080 and i am runnning pytorch 1.8.0+cu111, 1 environment and 16 agents.
     
  9. An-u-rag

    An-u-rag

    Joined:
    Nov 23, 2020
    Posts:
    6
    @Thorce I just started experimenting with mlagents and the documentation is poor to say the least. How do we exactly "use" the gpu? Are you just referring to the --torch-device=cuda flag? I tried this but I barely see any different. Also I am using visual observations, so this means that I can not run in --no-graphics mode correct?
     
  10. Thorce

    Thorce

    Joined:
    Jul 3, 2019
    Posts:
    41
    To use your GPU for training you have to install the cuda version of pytorch. As to running in no graphics mode with visual observation, I think that does not work, but I do ave very limited experience with visual observations