I created a simple environment with self play and since my agents had a hard time learning the basics I created a scene without self play to have them learn from fighting a simple programmed ai which improve with curriculum training. This worked perfectly and then I ran the original self play with --initialize-from to have them improve in self play. They are improving (I ran the models manually and saw them winning 90% of fights against original model) but the Cumulative Reward is negative and the ELO is going down. The only thing I could think of is that the learning agent is running with entropy so it's not as good as it's actual model and the ghost trainers it's fighting against. Am I missing something here? Is there a way to calculate ELO without having it be affected by entropy?