Hi to all I'm looking for a stable training baseline to build my research on it. Long-run stability sounds important to me for reaching advanced results. I took the 6th release of ML-Agents, changed max_steps from 20_000_000 to 1_000_000_000 at config/.../WalkerStatic.yaml and launched training with PPO and SAC defining num-envs = 8. SAC appeared to be a bit more stable. But both algorithms degrade after 20-50M training steps. SAC: PPO: Would be thankful for any ideas regarding why that may happen and how to setup a baseline that would plateau at a highest achievable reward without reward drops. Or, maybe, some clues why that might be not possible due to some specifics of Unity physics engine, current ML-Agents implementation or something else. Thanks, Yehor.