Search Unity

ELO calculation in an ML-Agents self play training process

Discussion in 'ML-Agents' started by Streiicher, Dec 16, 2020.

  1. Streiicher

    Streiicher

    Joined:
    Sep 26, 2020
    Posts:
    2
    I look for some help to understand the ELO results in the training process of the following game I created:

    The game is a symmetric 17+4 variant for 2 players (that is a Black Jack variant with 32 cards and card values which are different from Black Jack and no dealer).

    In the first example the values of the 8 cards in every suit are 1,2,3,4,5,6,7,8. This seems to work well: PPO with self play delivers these graphs for the mean rewards and elo:

    p2.png p1.png

    If I play against the produced brain model, the model plays very well. I coundn‘t detect any failures done by the model.

    In a second example the values of the 8 cards in every suit are 2,3,4,7,8,9,10,11. The same PPO / self play training now delivers these graphs for the mean rewards and elo:

    p4.png p3.png

    There is a small bias in the cumulative rewards and, after a roughly 1 million steps, a decrease in elo.
    But: the code has been reviewed intensively, so that I am quite confident that the bias isn´t caused by my code. And: the trained model again plays very well versus a human player, so I wonder about the decreasing elo.

    My questions: Do you have an idea what can cause such effects? Where can I find really detailed documentation of the training process and Elo calculation?

    Thanks a lot!
     
  2. christophergoy

    christophergoy

    Unity Technologies

    Joined:
    Sep 16, 2015
    Posts:
    735
    HI @Streiicher,
    You can find the documentation for self-play here. I will reach out to someone on our research team to see if they can answer your questions.
    Cheers,
    Chris
     
  3. andrewcoh_unity

    andrewcoh_unity

    Unity Technologies

    Joined:
    Sep 5, 2019
    Posts:
    162
    Hi @Streiicher

    It looks like the final rewards are negative. The ELO calculation assumes the final reward determines the winner i.e. a positive reward indicates winning, negative losing, zero draw. So, what appears to be happening is the agent is always 'losing'.

    If it doesn't seem possible to specify a reward function that satisfies this for your game, please let me know and we can try to help.
     
  4. Streiicher

    Streiicher

    Joined:
    Sep 26, 2020
    Posts:
    2
    Hi @andrewcoh_unity, thx for your reply. The reward structure ist: 1 for win and -1 for loss of agent, 0 for a draw. In the documentation Christopher sent before I cannot find something about ELO calculation. Can you give me a hint about to look this up?
    Best, Martin
     
  5. andrewcoh_unity

    andrewcoh_unity

    Unity Technologies

    Joined:
    Sep 5, 2019
    Posts:
    162