ELO decreasing with positive mean reward

Discussion in 'ML-Agents' started by GGCid, Mar 22, 2021.

1. GGCid

Joined:
Sep 4, 2019
Posts:
4
Hello,

I have setup a competitive environment with two teams of agents and as the mean reward has increased overtime, reaching satisfactory results, the ELO has been (almost linearly) decreasing.
The past mean rewards of the agent are consistently positive (although the STD is high too), the ELO score of the agent keeps decreasing.
Is it common to see the ELO decrease with positive mean rewards?
Is it bad that the ELO is decreasing? or can it be ignored as the agent has learned to act in the said environment?

As for the environment and the reward function, there are two teams of agents in a fight, with the following rewards:
• Reward depending on the distance between the agent and its closest enemy (negative if its far away and positive if its close)
• Positive reward for useful actions, such as: hit enemy or defended an attack
• Negative reward when taking damage
• Setting of the episode reward depending on the result of the battle: -1 if the team lost, 0 if it was a draw and +1 if the team won
Although the mean reward surpasses the value of 1, it has learned to a satisfactory degree, but as mentioned above, the ELO keeps decreasing.

2. mamaorha

Joined:
Jun 16, 2015
Posts:
44
are you sure your last reward is -1/0/1 depend on loose/draw/win?

3. GGCid

Joined:
Sep 4, 2019
Posts:
4
The mean reward does go above the value of 1. In any case it is set to the specific values at the end of the episode.
Does it strictly have to be +1 to be considered as win and -1 for loss? Or just being >=1 can be considered as win and <=-1 for loss?

Unity Technologies

Joined:
Sep 16, 2015
Posts:
735
for the ELO to be calculated correctly, the rewards need to consistent in that the losing team's reward is always negative, the winning team's reward is always positive, and that they are equal to 0 in a draw.

Your reward functions may be giving too high of a reward to the losing team, which is why your ELO may be decreasing.

Unity Technologies

Joined:
Sep 16, 2015
Posts:
735
So, if a team loses, you should set all of the agents' rewards on that team to -1. When a team wins, you can set the reward to 1.0f - penalties. This lets the agent know that although it won, it can do better.

6. GGCid

Joined:
Sep 4, 2019
Posts:
4

During an episode, when an agent dies I do not set its reward to negative and instead try to set it when all the agents of the team die (when they lose). As I am disabling agents that have died, I try to set their reward to negative during their re-activation, which most likely messes things up (reward set for wrong episode).
Nevertheless, I will have to test it and come back to you.
Thank you both for your replies.

On that note though, how should one design the reward function in this kind of situation? The agent dies during an episode, but may win due to the actions of another agent.
For example, it's a 2 versus 2 situation, where one agent from the first team dies, but its remaining teammate manages to defeat the other two agents. Thus, the result for the first team is a win, but the agent's episode reward is still negative.
Do you just reward agents that win and survived and simply punish agents for dying?

Unity Technologies

Joined:
Sep 16, 2015
Posts:
735
Hi @GGCid,
Funny you should ask about this as we just had a release that addresses this issue and our research team implemented a novel algorithm to solve it. Take a look at Release 15 of ml-agents

Major Changes

com.unity.ml-agents (C#)
The BufferSensor and BufferSensorComponent have been added. They allow the Agent to observe variable number of entities. For an example, see the Sorter environment. (#4909)
The SimpleMultiAgentGroup class and IMultiAgentGroup interface have been added. These allow Agents to be given rewards and end episodes in groups. For examples, see the Cooperative Push Block, Dungeon Escape and Soccer environments. (#4923)

ml-agents / ml-agents-envs / gym-unity (Python)
The MA-POCA trainer has been added. This is a new trainer that enables Agents to learn how to work together in groups. Configure poca as the trainer in the configuration YAML after instantiating a SimpleMultiAgentGroup to use this feature. (#5005)

Last edited: Mar 22, 2021
8. GGCid

Joined:
Sep 4, 2019
Posts:
4
That's great!
I will check this out as well.
Thank you for you help.

9. mamaorha

Joined:
Jun 16, 2015
Posts:
44
sorry to hijack this thread, but i need clarification on what you wrote here.
during the game we assign rewards&penalties and on the conclusion i added reward as -1/0/1 based on loose/draw/win, from what i understand here its wrong?

i did it based on this documnetation
https://github.com/Unity-Technologi...-Configuration-File.md#note-on-reward-signals

should i do it like the following:
1. give penalties/rewards saving the accumulated thing asside
2. on game conclusion - use the "SetReward" assigning -1 / 0 based on loose/draw and SetReward(1-(rewards-penalties)) to the winning team?

Unity Technologies

Joined:
Sep 16, 2015
Posts:
735
Depending on the environment, you may want to ensure that that winning team knows they can do better, and did some things that weren’t so great.

I ran into this with my own experiments training the tanks environment. I would penalize a tank for shooting itself

I give a small reward for hitting another tank:

and even just for shooting to help nudge it to shoot as little as possible.

at the end of the around the loser gets:
SetReward(-1.0f)

the winner gets:

doing AddReward instead of SetReward for the winner makes gives it that extra information to help shape the behavior for the winners.

this is not necessary for every environment. It can help for situations where there are very sparse rewards.

Joined:
Dec 31, 2017
Posts:
473
I have a related question about this (not sure if that warrants its own thread) - In most situations, one agent wins and the other loses, so I assign +1/-1 rewards accordingly. But then there are also cases where an agent can be disqualified for performing some prohibited action unrelated to win/lose criteria. If that happens, the agent receives a -1 penalty and the episode ends. However, the opponent hasn't really won, it didn't contribute anything to ending the episode, so it isn't rewarded. As a result, cumulative rewards are slightly below zero on average, and ELO keeps decreasing.
How should I change my reward logic to be more in line with self-play zero sums, but still be informative as to whether an episode ended in win/lose or not?

Unity Technologies

Joined:
Sep 16, 2015
Posts:
735
I had a similar question when I was tinkering with tanks. At some point during training, one policy would just wait for the other to kill itself. This is technically a "win" but not behavior I want to reward. In the end, the discussion ended with that it still makes sense to reward that policy because it technically did win. In theory, its backing up behavior will be punished later by a better policy, which will still result in the ELO going up.

So, an older policy does something janky: lose:-1, win:1

Last edited: Mar 31, 2021
13. WaxyMcRivers

Joined:
May 9, 2016
Posts:
57
Given this statement and the results that mbaske is referring to (the dropping elo): is it safe to conclude that with some self play problems, it's ok/normal to observe a drop in ELO initially?

Unity Technologies

Joined:
Sep 16, 2015
Posts:
735
I can’t give you a definitive answer here, if your ELO is continually going down, maybe there is something wrong. Or maybe the network is learning the rules of the game and it’s exploration of the environment is causing agent to lose more often. In these cases my intuition would be to let it train for a while and see what happens over the long run. With sparse reward structures it can take a lot of time for an agent to hit that “ah ha!” Moment.

15. HQF

Joined:
Aug 28, 2015
Posts:
38
So you talking about Agent's reward? I mean in end of episode I need to SetReward for Groups -1/0/1-penalty and in addition I need to SetReward for each agent in loosing group as -1?

For now my reward system is:
-existentional penalty to force agent making less actions to reach result
+0.1 reward to agent when he attack another team agen
-0.05 penalty if agent was attacked.
+0.5 reward when agent destroy another agent
-0.25 penaltyif agent was destroyed

And for groups I use this reward at end of episode:
SetReward(-1/0/1-penalty) (based on loose/draft/win

Penalty is currStep/maxSteps