Search Unity

Question ML-Agents Release 6 WalkerStatic multi-env long training instability

Discussion in 'ML-Agents' started by yehor_k, Aug 25, 2020.

  1. yehor_k

    yehor_k

    Joined:
    Aug 25, 2020
    Posts:
    3
    Hi to all :)

    I'm looking for a stable training baseline to build my research on it. Long-run stability sounds important to me for reaching advanced results.

    I took the 6th release of ML-Agents, changed max_steps from 20_000_000 to 1_000_000_000 at config/.../WalkerStatic.yaml and launched training with PPO and SAC defining num-envs = 8.

    SAC appeared to be a bit more stable. But both algorithms degrade after 20-50M training steps.
    SAC:
    SAC.png
    PPO:
    PPO.png

    Would be thankful for any ideas regarding why that may happen and how to setup a baseline that would plateau at a highest achievable reward without reward drops. Or, maybe, some clues why that might be not possible due to some specifics of Unity physics engine, current ML-Agents implementation or something else.

    Thanks,
    Yehor.
     
    unity_-DoCqyPS6-iU3A likes this.
  2. ervteng_unity

    ervteng_unity

    Unity Technologies

    Joined:
    Dec 6, 2018
    Posts:
    150
    Issues like this are usually caused by a glitch/bug in the environment. For instance, an observation could inadvertently be set to NaN, or the agent could fall off the platform, introducing a bad value in the data and perhaps setting the neural network weights to NaN.

    We've made considerable stability tweaks to Walker since the last release (in addition to a normalization fix that could have been causing NaNs on the trainer), could you try the latest version on master?
     
  3. yehor_k

    yehor_k

    Joined:
    Aug 25, 2020
    Posts:
    3
    @ervteng_unity thanks a lot for your clues and for the awesomely fast response! I've just switched to master, rebuilt the WalkerStatic env and relaunched training with same parameters as before. I'll come back to this thread in near days once I get new reward plots :rolleyes:
     
  4. yehor_k

    yehor_k

    Joined:
    Aug 25, 2020
    Posts:
    3
    @ervteng_unity here are results of using ML-Agents master as of the 25th of August 2020 for long-running training with 8 parallel envs utilized:
    SAC:
    2020-08-25-sac.png (died after 4d 17h of wall time training)
    PPO:
    2020-08-25-ppo.png (died after 21h of wall time training)
    So PPO became a bit more stable, while SAC showed to be stable for quite long, but, unfortunately, caught something and died after 160M timesteps.

    Do I correctly understand that your bid is that it is caused by NaNs passed to the neural network somewhere within the observations vector? While that observed values are in fact not NaNs and that are likely Unity Physics engine flaws, right? Maybe, you have some prior experience regarding what components are usually causing that NaNs, are you? Like: positions, rotations, velocities, strengths, etc... Any in-depth clues would be very helpful, if available :)