I cannot really understand the Cumulative Reward plot when training with PPO, especially when I have multiple agents. Does it plot the average cumulative reward of the N agents? Or does it plot the sum of the rewards of the agents? And when and how is it computed? Is it an average for some "validation" episodes? Or is it the "in training" reward?