According to docs, the reward in self-play should be +1for winning, -1 for losing, and 0 for a draw. However, in my case, it's hard for learning if rewards are only 1, -1, 0. I put some more rewards which are beyond 1 and -1 to teach them useful actions. Then in finals, if then agents win => SetReward(1f), lose => SetReward(-1), at maxStep => SetReward(0f). After 1 million step training, I saw only negative mean reward (50k each summary), and ELO decreased overall. I'm wondering if this is caused because of my reward shape.