Search Unity

  1. If you have experience with import & exporting custom (.unitypackage) packages, please help complete a survey (open until May 15, 2024).
    Dismiss Notice
  2. Unity 6 Preview is now available. To find out what's new, have a look at our Unity 6 Preview blog post.
    Dismiss Notice

ELO decreasing with positive mean reward

Discussion in 'ML-Agents' started by GGCid, Mar 22, 2021.

  1. GGCid

    GGCid

    Joined:
    Sep 4, 2019
    Posts:
    4
    Hello,

    I have setup a competitive environment with two teams of agents and as the mean reward has increased overtime, reaching satisfactory results, the ELO has been (almost linearly) decreasing.
    The past mean rewards of the agent are consistently positive (although the STD is high too), the ELO score of the agent keeps decreasing.
    Is it common to see the ELO decrease with positive mean rewards?
    Is it bad that the ELO is decreasing? or can it be ignored as the agent has learned to act in the said environment?

    As for the environment and the reward function, there are two teams of agents in a fight, with the following rewards:
    • Reward depending on the distance between the agent and its closest enemy (negative if its far away and positive if its close)
    • Positive reward for useful actions, such as: hit enemy or defended an attack
    • Negative reward when taking damage
    • Setting of the episode reward depending on the result of the battle: -1 if the team lost, 0 if it was a draw and +1 if the team won
    Although the mean reward surpasses the value of 1, it has learned to a satisfactory degree, but as mentioned above, the ELO keeps decreasing.
     
  2. mamaorha

    mamaorha

    Joined:
    Jun 16, 2015
    Posts:
    44
    are you sure your last reward is -1/0/1 depend on loose/draw/win?
     
  3. GGCid

    GGCid

    Joined:
    Sep 4, 2019
    Posts:
    4
    The mean reward does go above the value of 1. In any case it is set to the specific values at the end of the episode.
    Does it strictly have to be +1 to be considered as win and -1 for loss? Or just being >=1 can be considered as win and <=-1 for loss?
     
  4. christophergoy

    christophergoy

    Unity Technologies

    Joined:
    Sep 16, 2015
    Posts:
    735
    for the ELO to be calculated correctly, the rewards need to consistent in that the losing team's reward is always negative, the winning team's reward is always positive, and that they are equal to 0 in a draw.

    Your reward functions may be giving too high of a reward to the losing team, which is why your ELO may be decreasing.
     
  5. christophergoy

    christophergoy

    Unity Technologies

    Joined:
    Sep 16, 2015
    Posts:
    735
    So, if a team loses, you should set all of the agents' rewards on that team to -1. When a team wins, you can set the reward to 1.0f - penalties. This lets the agent know that although it won, it can do better.
     
  6. GGCid

    GGCid

    Joined:
    Sep 4, 2019
    Posts:
    4
    Thinking about this, most likely that's the issue.

    During an episode, when an agent dies I do not set its reward to negative and instead try to set it when all the agents of the team die (when they lose). As I am disabling agents that have died, I try to set their reward to negative during their re-activation, which most likely messes things up (reward set for wrong episode).
    Nevertheless, I will have to test it and come back to you.
    Thank you both for your replies.

    On that note though, how should one design the reward function in this kind of situation? The agent dies during an episode, but may win due to the actions of another agent.
    For example, it's a 2 versus 2 situation, where one agent from the first team dies, but its remaining teammate manages to defeat the other two agents. Thus, the result for the first team is a win, but the agent's episode reward is still negative.
    Do you just reward agents that win and survived and simply punish agents for dying?
     
  7. christophergoy

    christophergoy

    Unity Technologies

    Joined:
    Sep 16, 2015
    Posts:
    735
    Hi @GGCid,
    Funny you should ask about this as we just had a release that addresses this issue and our research team implemented a novel algorithm to solve it. Take a look at Release 15 of ml-agents

    Major Changes

    com.unity.ml-agents (C#)
    The BufferSensor and BufferSensorComponent have been added. They allow the Agent to observe variable number of entities. For an example, see the Sorter environment. (#4909)
    The SimpleMultiAgentGroup class and IMultiAgentGroup interface have been added. These allow Agents to be given rewards and end episodes in groups. For examples, see the Cooperative Push Block, Dungeon Escape and Soccer environments. (#4923)

    ml-agents / ml-agents-envs / gym-unity (Python)
    The MA-POCA trainer has been added. This is a new trainer that enables Agents to learn how to work together in groups. Configure poca as the trainer in the configuration YAML after instantiating a SimpleMultiAgentGroup to use this feature. (#5005)
     
    Last edited: Mar 22, 2021
  8. GGCid

    GGCid

    Joined:
    Sep 4, 2019
    Posts:
    4
    That's great!
    I will check this out as well.
    Thank you for you help.
     
  9. mamaorha

    mamaorha

    Joined:
    Jun 16, 2015
    Posts:
    44
    sorry to hijack this thread, but i need clarification on what you wrote here.
    during the game we assign rewards&penalties and on the conclusion i added reward as -1/0/1 based on loose/draw/win, from what i understand here its wrong?

    i did it based on this documnetation
    https://github.com/Unity-Technologi...-Configuration-File.md#note-on-reward-signals

    should i do it like the following:
    1. give penalties/rewards saving the accumulated thing asside
    2. on game conclusion - use the "SetReward" assigning -1 / 0 based on loose/draw and SetReward(1-(rewards-penalties)) to the winning team?
     
  10. christophergoy

    christophergoy

    Unity Technologies

    Joined:
    Sep 16, 2015
    Posts:
    735
    Depending on the environment, you may want to ensure that that winning team knows they can do better, and did some things that weren’t so great.

    I ran into this with my own experiments training the tanks environment. I would penalize a tank for shooting itself

    AddReward(-0.005f)

    I give a small reward for hitting another tank:

    AddReward(0.005f)

    and even just for shooting to help nudge it to shoot as little as possible.

    AddReward(-0.001f)

    at the end of the around the loser gets:
    SetReward(-1.0f)

    the winner gets:
    AddReward(1.0f)

    doing AddReward instead of SetReward for the winner makes gives it that extra information to help shape the behavior for the winners.

    this is not necessary for every environment. It can help for situations where there are very sparse rewards.
     
  11. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    473
    I have a related question about this (not sure if that warrants its own thread) - In most situations, one agent wins and the other loses, so I assign +1/-1 rewards accordingly. But then there are also cases where an agent can be disqualified for performing some prohibited action unrelated to win/lose criteria. If that happens, the agent receives a -1 penalty and the episode ends. However, the opponent hasn't really won, it didn't contribute anything to ending the episode, so it isn't rewarded. As a result, cumulative rewards are slightly below zero on average, and ELO keeps decreasing.
    How should I change my reward logic to be more in line with self-play zero sums, but still be informative as to whether an episode ended in win/lose or not?
     
  12. christophergoy

    christophergoy

    Unity Technologies

    Joined:
    Sep 16, 2015
    Posts:
    735
    I had a similar question when I was tinkering with tanks. At some point during training, one policy would just wait for the other to kill itself. This is technically a "win" but not behavior I want to reward. In the end, the discussion ended with that it still makes sense to reward that policy because it technically did win. In theory, its backing up behavior will be punished later by a better policy, which will still result in the ELO going up.

    So, an older policy does something janky: lose:-1, win:1
     
    Last edited: Mar 31, 2021
    mbaske likes this.
  13. WaxyMcRivers

    WaxyMcRivers

    Joined:
    May 9, 2016
    Posts:
    59
    Given this statement and the results that mbaske is referring to (the dropping elo): is it safe to conclude that with some self play problems, it's ok/normal to observe a drop in ELO initially?
     
  14. christophergoy

    christophergoy

    Unity Technologies

    Joined:
    Sep 16, 2015
    Posts:
    735
    I can’t give you a definitive answer here, if your ELO is continually going down, maybe there is something wrong. Or maybe the network is learning the rules of the game and it’s exploration of the environment is causing agent to lose more often. In these cases my intuition would be to let it train for a while and see what happens over the long run. With sparse reward structures it can take a lot of time for an agent to hit that “ah ha!” Moment.
     
  15. HQF

    HQF

    Joined:
    Aug 28, 2015
    Posts:
    40
    Hi! I'm making turn-based battler game, and I need a bit clarification about your reply:
    So you talking about Agent's reward? I mean in end of episode I need to SetReward for Groups -1/0/1-penalty and in addition I need to SetReward for each agent in loosing group as -1?

    For now my reward system is:
    -existentional penalty to force agent making less actions to reach result
    +0.1 reward to agent when he attack another team agen
    -0.05 penalty if agent was attacked.
    +0.5 reward when agent destroy another agent
    -0.25 penaltyif agent was destroyed

    And for groups I use this reward at end of episode:
    SetReward(-1/0/1-penalty) (based on loose/draft/win

    Penalty is currStep/maxSteps
     
  16. KaushalAgrawal

    KaushalAgrawal

    Joined:
    Dec 18, 2019
    Posts:
    8
    Hey, I am making a 4 player turn based card game, played as teams of 2 vs 2.
    Both teams are in different groups but share same behaviour.

    While training what is happening is my ElO increases with +ve mean group reward but at 200000 step, when there is team swap, it starts decreasing rapidly with -ve mean group reward.