Question ML-Agents Release 6 WalkerStatic multi-env long training instability

yehor_k · Aug 25, 2020

Hi to all

I'm looking for a stable training baseline to build my research on it. Long-run stability sounds important to me for reaching advanced results.

I took the 6th release of ML-Agents, changed max_steps from 20_000_000 to 1_000_000_000 at config/.../WalkerStatic.yaml and launched training with PPO and SAC defining num-envs = 8.

SAC appeared to be a bit more stable. But both algorithms degrade after 20-50M training steps.
SAC:

PPO:

Would be thankful for any ideas regarding why that may happen and how to setup a baseline that would plateau at a highest achievable reward without reward drops. Or, maybe, some clues why that might be not possible due to some specifics of Unity physics engine, current ML-Agents implementation or something else.

Thanks,
Yehor.

ervteng_unity · Aug 25, 2020

Issues like this are usually caused by a glitch/bug in the environment. For instance, an observation could inadvertently be set to NaN, or the agent could fall off the platform, introducing a bad value in the data and perhaps setting the neural network weights to NaN.

We've made considerable stability tweaks to Walker since the last release (in addition to a normalization fix that could have been causing NaNs on the trainer), could you try the latest version on master?

yehor_k · Aug 25, 2020

@ervteng_unity thanks a lot for your clues and for the awesomely fast response! I've just switched to master, rebuilt the WalkerStatic env and relaunched training with same parameters as before. I'll come back to this thread in near days once I get new reward plots

yehor_k · Aug 31, 2020

@ervteng_unity here are results of using ML-Agents master as of the 25th of August 2020 for long-running training with 8 parallel envs utilized:
SAC:
(died after 4d 17h of wall time training)
PPO:
(died after 21h of wall time training)
So PPO became a bit more stable, while SAC showed to be stable for quite long, but, unfortunately, caught something and died after 160M timesteps.

Do I correctly understand that your bid is that it is caused by NaNs passed to the neural network somewhere within the observations vector? While that observed values are in fact not NaNs and that are likely Unity Physics engine flaws, right? Maybe, you have some prior experience regarding what components are usually causing that NaNs, are you? Like: positions, rotations, velocities, strengths, etc... Any in-depth clues would be very helpful, if available

Search Unity

Question ML-Agents Release 6 WalkerStatic multi-env long training instability

yehor_k

ervteng_unity

Unity Technologies

yehor_k

yehor_k

Search Unity

Unity ID

Useful Searches

Question ML-Agents Release 6 WalkerStatic multi-env long training instability

yehor_k

ervteng_unity

Unity Technologies

yehor_k

yehor_k