Search Unity

Please provide a guide to optimizing training speed

Discussion in 'ML-Agents' started by mbaske, Jun 25, 2020.

  1. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    473
    Training speed is crucial in deep learning, but there seems to be some confusion about how it can be optimized with ML-Agents (at least for me).
    My initial thinking was that I'll need to invest in a bigger GPU. But after reading this discussion (https://github.com/Unity-Technologies/ml-agents/issues/4129), I got the impression that although GPUs can help with regard to rendering when using visual observations, they don't accelerate training otherwise. Or do they? This post (https://forum.unity.com/threads/cpu-vs-gpu.918869/#post-6018836) again seems to suggest that GPUs can be leveraged for training, given the right Tensorflow version and CUDA drivers.
    I'd also be interested in how GPU accelerated training (if possible) compares to CPU training with multiple executables.
    Finally, there's the time-scale issue: referring to these posts (https://forum.unity.com/threads/can...an-training-acceleration.919295/#post-6021062 https://forum.unity.com/threads/speed-of-the-training.907058/#post-5977154), I wonder if there're some real world data on if and when high time-scales become a problem for training.

    It would be great to have all the relevant info on training speed in one place.
    Thanks!
     
  2. andrzej_

    andrzej_

    Joined:
    Dec 2, 2016
    Posts:
    81
    I use a tf-gpu version so the training should, in theory, utilize GPU to a full extent, but by ramping up the number of parallel environments I run I get issues with memory allocation. I get ~10-15% CPU and GPU load and at least when I look at the task manager I still have plenty of free memory (it's 64GB RAM + 24GB VRAM).
    But I should add that it's training with visual observations and I haven't tried copying agents within the same Unity instance.
    For vector observations you might want to try a headless server build so there's no bottleneck with rendering.

    I do hope that at one point all the ECS features, including DOTS cameras, will remove all bottlenecks and help with scaling ML-agents.
     
  3. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    473
    Does it in practice though? Is there a decisive speed advantage over CPU training?
    Yeah, I hardly use visual observations and mostly train with the --no-graphics option.
     
  4. andrzej_

    andrzej_

    Joined:
    Dec 2, 2016
    Posts:
    81
    I haven't done testing to be honest.
    My next little experiment will probably be without visual observations, so I'll be able to run it with --no-graphics and/or try to disable CUDA devices (too lazy to make a non-gpu TF anacond env just for testing ;))
     
  5. ervteng_unity

    ervteng_unity

    Unity Technologies

    Joined:
    Dec 6, 2018
    Posts:
    150
    Hey all, this is an excellent suggestion and I've logged it as something we should add to the documentation.

    Part of the difficulty is that what optimizations work and don't work are very dependent on how the Unity game is written, and it's hard to give advice that would apply to all games. For instance, we've found that increasing timescale are particularly an issue if the walls/colliders are thin and the agents move at high velocity, but the shape of the agent also matters, etc.

    But some things are more clear-cut - for instance the use of tf-GPU usually does not help unless you're using visual observations or very large batches. GPUs are good at doing many parallel computations at once, but they have some overhead in terms of getting the data into and out of GPU memory. For small fully-connected networks (e.g. vector obs) with the type of batch sizes used in RL, that overhead typically outweighs the performance gains. With visual obs, you can usually improve GPU utilization by increasing the batch size.
     
    mbaske likes this.