Hi, I'm debugging an issue with the SAC trainer where agents learn to play very well then regress to doing nothing. In my setup, there are two competing teams and I use zero-sum rewards (i.e. if Team A gets +R Team B gets -R, for anything). The setup works very well up to a point where everything goes wrong and they all decide to just do nothing. (If I save the model before that point, they actually play pretty well.) I have tried exiting and reloading at regular intervals to make sure it's not a problem with running the simulation itself for a long period of time. It's really not that. I've attached a screenshot of an example tensorboard. A few notes: - Cumulative rewards are always zero due to zero-sum. - Episode length goes down if one team can fulfill the winning condition before time runs out, so it's a good thing if the episode ends early - After a while, the episode length goes back to the max due to all agents doing nothing (and not due to better defense, for instance) Now, I find the Extrinsic value estimate graph confusing, cause the value estimate seems to go *down* when one team is fulfilling the winning condition! Is there a reason a state where one team gets +R and the other gets -R would have a lower value estimate than a state where they both get 0?