ELO decreasing with positive mean reward

GGCid · Mar 22, 2021

Hello,

I have setup a competitive environment with two teams of agents and as the mean reward has increased overtime, reaching satisfactory results, the ELO has been (almost linearly) decreasing.
The past mean rewards of the agent are consistently positive (although the STD is high too), the ELO score of the agent keeps decreasing.
Is it common to see the ELO decrease with positive mean rewards?
Is it bad that the ELO is decreasing? or can it be ignored as the agent has learned to act in the said environment?

As for the environment and the reward function, there are two teams of agents in a fight, with the following rewards:

Reward depending on the distance between the agent and its closest enemy (negative if its far away and positive if its close)

Positive reward for useful actions, such as: hit enemy or defended an attack

Negative reward when taking damage

Setting of the episode reward depending on the result of the battle: -1 if the team lost, 0 if it was a draw and +1 if the team won

Although the mean reward surpasses the value of 1, it has learned to a satisfactory degree, but as mentioned above, the ELO keeps decreasing.

mamaorha · Mar 22, 2021

are you sure your last reward is -1/0/1 depend on loose/draw/win?

GGCid · Mar 22, 2021

mamaorha said: ↑

are you sure your last reward is -1/0/1 depend on loose/draw/win?
Click to expand...

The mean reward does go above the value of 1. In any case it is set to the specific values at the end of the episode.
Does it strictly have to be +1 to be considered as win and -1 for loss? Or just being >=1 can be considered as win and <=-1 for loss?

christophergoy · Mar 22, 2021

for the ELO to be calculated correctly, the rewards need to consistent in that the losing team's reward is always negative, the winning team's reward is always positive, and that they are equal to 0 in a draw.

Your reward functions may be giving too high of a reward to the losing team, which is why your ELO may be decreasing.

christophergoy · Mar 22, 2021

So, if a team loses, you should set all of the agents' rewards on that team to -1. When a team wins, you can set the reward to 1.0f - penalties. This lets the agent know that although it won, it can do better.

GGCid · Mar 22, 2021

christophergoy said: ↑

Your reward functions may be giving too high of a reward to the losing team, which is why your ELO may be decreasing.
Click to expand...

Thinking about this, most likely that's the issue.

During an episode, when an agent dies I do not set its reward to negative and instead try to set it when all the agents of the team die (when they lose). As I am disabling agents that have died, I try to set their reward to negative during their re-activation, which most likely messes things up (reward set for wrong episode).
Nevertheless, I will have to test it and come back to you.
Thank you both for your replies.

On that note though, how should one design the reward function in this kind of situation? The agent dies during an episode, but may win due to the actions of another agent.
For example, it's a 2 versus 2 situation, where one agent from the first team dies, but its remaining teammate manages to defeat the other two agents. Thus, the result for the first team is a win, but the agent's episode reward is still negative.
Do you just reward agents that win and survived and simply punish agents for dying?

christophergoy · Mar 22, 2021

Hi @GGCid,
Funny you should ask about this as we just had a release that addresses this issue and our research team implemented a novel algorithm to solve it. Take a look at Release 15 of ml-agents

Major Changes

com.unity.ml-agents (C#)
The BufferSensor and BufferSensorComponent have been added. They allow the Agent to observe variable number of entities. For an example, see the Sorter environment. (#4909)
The SimpleMultiAgentGroup class and IMultiAgentGroup interface have been added. These allow Agents to be given rewards and end episodes in groups. For examples, see the Cooperative Push Block, Dungeon Escape and Soccer environments. (#4923)

ml-agents / ml-agents-envs / gym-unity (Python)
The MA-POCA trainer has been added. This is a new trainer that enables Agents to learn how to work together in groups. Configure poca as the trainer in the configuration YAML after instantiating a SimpleMultiAgentGroup to use this feature. (#5005)

GGCid · Mar 22, 2021

christophergoy said: ↑

Hi @GGCid,
Funny you should ask about this as we just had a release that addresses this issue and our research team implemented a novel algorithm to solve it. Take a look at Release 15 of ml-agents

Major Changes

com.unity.ml-agents (C#)
The BufferSensor and BufferSensorComponent have been added. They allow the Agent to observe variable number of entities. For an example, see the Sorter environment. (#4909)
The SimpleMultiAgentGroup class and IMultiAgentGroup interface have been added. These allow Agents to be given rewards and end episodes in groups. For examples, see the Cooperative Push Block, Dungeon Escape and Soccer environments. (#4923)

ml-agents / ml-agents-envs / gym-unity (Python)
The MA-POCA trainer has been added. This is a new trainer that enables Agents to learn how to work together in groups. Configure poca as the trainer in the configuration YAML after instantiating a SimpleMultiAgentGroup to use this feature. (#5005)
Click to expand...

That's great!
I will check this out as well.
Thank you for you help.

mamaorha · Mar 23, 2021

christophergoy said: ↑

So, if a team loses, you should set all of the agents' rewards on that team to -1. When a team wins, you can set the reward to 1.0f - penalties. This lets the agent know that although it won, it can do better.
Click to expand...

sorry to hijack this thread, but i need clarification on what you wrote here.
during the game we assign rewards&penalties and on the conclusion i added reward as -1/0/1 based on loose/draw/win, from what i understand here its wrong?

i did it based on this documnetation
https://github.com/Unity-Technologi...-Configuration-File.md#note-on-reward-signals

should i do it like the following:
1. give penalties/rewards saving the accumulated thing asside
2. on game conclusion - use the "SetReward" assigning -1 / 0 based on loose/draw and SetReward(1-(rewards-penalties)) to the winning team?

christophergoy · Mar 23, 2021

Depending on the environment, you may want to ensure that that winning team knows they can do better, and did some things that weren’t so great.

I ran into this with my own experiments training the tanks environment. I would penalize a tank for shooting itself

AddReward(-0.005f)

I give a small reward for hitting another tank:

AddReward(0.005f)

and even just for shooting to help nudge it to shoot as little as possible.

AddReward(-0.001f)

at the end of the around the loser gets:
SetReward(-1.0f)

the winner gets:
AddReward(1.0f)

doing AddReward instead of SetReward for the winner makes gives it that extra information to help shape the behavior for the winners.

this is not necessary for every environment. It can help for situations where there are very sparse rewards.

mbaske · Mar 30, 2021

christophergoy said: ↑

for the ELO to be calculated correctly, the rewards need to consistent in that the losing team's reward is always negative, the winning team's reward is always positive, and that they are equal to 0 in a draw.
Click to expand...

I have a related question about this (not sure if that warrants its own thread) - In most situations, one agent wins and the other loses, so I assign +1/-1 rewards accordingly. But then there are also cases where an agent can be disqualified for performing some prohibited action unrelated to win/lose criteria. If that happens, the agent receives a -1 penalty and the episode ends. However, the opponent hasn't really won, it didn't contribute anything to ending the episode, so it isn't rewarded. As a result, cumulative rewards are slightly below zero on average, and ELO keeps decreasing.
How should I change my reward logic to be more in line with self-play zero sums, but still be informative as to whether an episode ended in win/lose or not?

christophergoy · Mar 31, 2021

I had a similar question when I was tinkering with tanks. At some point during training, one policy would just wait for the other to kill itself. This is technically a "win" but not behavior I want to reward. In the end, the discussion ended with that it still makes sense to reward that policy because it technically did win. In theory, its backing up behavior will be punished later by a better policy, which will still result in the ELO going up.

So, an older policy does something janky: lose:-1, win:1

WaxyMcRivers · May 28, 2021

christophergoy said: ↑

I had a similar question when I was tinkering with tanks. At some point during training, one policy would just wait for the other to kill itself. This is technically a "win" but not behavior I want to reward. In the end, the discussion ended with that it still makes sense to reward that policy because it technically did win. In theory, its backing up behavior will be punished later by a better policy, which will still result in the ELO going up.

So, an older policy does something janky: lose:-1, win:1
Click to expand...

Given this statement and the results that mbaske is referring to (the dropping elo): is it safe to conclude that with some self play problems, it's ok/normal to observe a drop in ELO initially?

christophergoy · May 29, 2021

WaxyMcRivers said: ↑

Given this statement and the results that mbaske is referring to (the dropping elo): is it safe to conclude that with some self play problems, it's ok/normal to observe a drop in ELO initially?
Click to expand...

I can’t give you a definitive answer here, if your ELO is continually going down, maybe there is something wrong. Or maybe the network is learning the rules of the game and it’s exploration of the environment is causing agent to lose more often. In these cases my intuition would be to let it train for a while and see what happens over the long run. With sparse reward structures it can take a lot of time for an agent to hit that “ah ha!” Moment.

HQF · May 14, 2022

christophergoy said: ↑

Depending on the environment, you may want to ensure that that winning team knows they can do better, and did some things that weren’t so great.

I ran into this with my own experiments training the tanks environment. I would penalize a tank for shooting itself

AddReward(-0.005f)

I give a small reward for hitting another tank:

AddReward(0.005f)

and even just for shooting to help nudge it to shoot as little as possible.

AddReward(-0.001f)

at the end of the around the loser gets:
SetReward(-1.0f)

the winner gets:
AddReward(1.0f)

doing AddReward instead of SetReward for the winner makes gives it that extra information to help shape the behavior for the winners.

this is not necessary for every environment. It can help for situations where there are very sparse rewards.
Click to expand...

Hi! I'm making turn-based battler game, and I need a bit clarification about your reply:
So you talking about Agent's reward? I mean in end of episode I need to SetReward for Groups -1/0/1-penalty and in addition I need to SetReward for each agent in loosing group as -1?

For now my reward system is:
-existentional penalty to force agent making less actions to reach result
+0.1 reward to agent when he attack another team agen
-0.05 penalty if agent was attacked.
+0.5 reward when agent destroy another agent
-0.25 penaltyif agent was destroyed

And for groups I use this reward at end of episode:
SetReward(-1/0/1-penalty) (based on loose/draft/win

Penalty is currStep/maxSteps

KaushalAgrawal · Oct 13, 2023

Hey, I am making a 4 player turn based card game, played as teams of 2 vs 2.
Both teams are in different groups but share same behaviour.

While training what is happening is my ElO increases with +ve mean group reward but at 200000 step, when there is team swap, it starts decreasing rapidly with -ve mean group reward.

Search Unity

ELO decreasing with positive mean reward

GGCid

mamaorha

GGCid

christophergoy

Unity Technologies

christophergoy

Unity Technologies

GGCid

christophergoy

Unity Technologies

GGCid

mamaorha

christophergoy

Unity Technologies

mbaske

christophergoy

Unity Technologies

WaxyMcRivers

christophergoy

Unity Technologies

HQF

KaushalAgrawal

Search Unity

Unity ID

Useful Searches

ELO decreasing with positive mean reward

Unity Technologies

Unity Technologies

Unity Technologies

Unity Technologies

Unity Technologies

Unity Technologies