Taking this image into account: Is there a problem with this training, because the final result that my agents have is the one that I was looking for. What does it mean that at the 3,5M steps the rewards starts to go down? Is this a problem? How can I explain this behaviour in my document? Here you have the entrpoy if someone need it. Is this because the reward that it recieves from the entropy?