Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.

Constant ELO during self-play training

Discussion in 'ML-Agents' started by niels_modlai, Mar 6, 2020.

  1. niels_modlai

    niels_modlai

    Joined:
    Oct 8, 2019
    Posts:
    5
    The ELO doesn't increase from the initial value of 1200 when I run self-play. What could be the reason?
    Code (CSharp):
    1.  
    2. INFO:mlagents.trainers: ppo: My Behavior: Step: 5000. Time Elapsed: 184.189 s Mean Reward: -3.635. Std of Reward: 1.030. Training.
    3. INFO:mlagents.trainers: Learning brain My Behavior?team=1 ELO: 1200.000
    4. Mean Opponent ELO: 1200.000 Std Opponent ELO: 0.000
    5. INFO:mlagents.trainers: ppo: My Behavior: Step: 10000. Time Elapsed: 349.838 s Mean Reward: -6.406. Std of Reward: 6.074. Training.
    6. INFO:mlagents.trainers: Learning brain My Behavior?team=1 ELO: 1200.000
    7. Mean Opponent ELO: 1200.000 Std Opponent ELO: 0.000
    8. INFO:mlagents.trainers: ppo: My Behavior: Step: 15000. Time Elapsed: 519.350 s Mean Reward: -2.528. Std of Reward: 2.163. Training.
    9. INFO:mlagents.trainers: Learning brain My Behavior?team=1 ELO: 1200.000
    10. Mean Opponent ELO: 1200.000 Std Opponent ELO: 0.000
    11.  
    My config file:
    Code (CSharp):
    1. default:
    2.     trainer: ppo
    3.     batch_size: 1024
    4.     beta: 5.0e-3
    5.     buffer_size: 10240
    6.     epsilon: 0.2
    7.     hidden_units: 128
    8.     lambd: 0.95
    9.     learning_rate: 3.0e-4
    10.     learning_rate_schedule: constant
    11.     max_steps: 5.0e5
    12.     memory_size: 128
    13.     normalize: false
    14.     num_epoch: 3
    15.     num_layers: 2
    16.     time_horizon: 64
    17.     sequence_length: 64
    18.     summary_freq: 5000
    19.     use_recurrent: false
    20.     vis_encode_type: simple
    21.     reward_signals:
    22.         extrinsic:
    23.             strength: 1.0
    24.             gamma: 0.99
    25.     self_play:
    26.         window: 10
    27.         play_against_current_self_ratio: 0.5
    28.         save_steps: 5000
    29.         swap_steps: 5000
    My behavior settings:

     
  2. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    473
    Same here, ELO stays constant at 1200.

    Code (CSharp):
    1.  
    2. (ml-agents14_1) C:\Users\mathias\Documents\Unity Projects\ml-agents-14.1\Project>mlagents-learn config/trainer_config.yaml --train --run-id=football_sp_2x256 --keep-checkpoints=100 --base-port=5006 --env=build/football.exe --no-graphics --num-envs=8 --time-scale=50  --load
    3. WARNING:tensorflow:From c:\users\mathias\anaconda3\envs\ml-agents14_1\lib\site-packages\tensorflow_core\python\compat\v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
    4. Instructions for updating:
    5. non-resource variables are not supported in the long term
    6.  
    7.  
    8.                         ▄▄▄▓▓▓▓
    9.                    ╓▓▓▓▓▓▓█▓▓▓▓▓
    10.               ,▄▄▄m▀▀▀'  ,▓▓▓▀▓▓▄                           ▓▓▓  ▓▓▌
    11.            ▄▓▓▓▀'      ▄▓▓▀  ▓▓▓      ▄▄     ▄▄ ,▄▄ ▄▄▄▄   ,▄▄ ▄▓▓▌▄ ▄▄▄    ,▄▄
    12.           ▄▓▓▓▀        ▄▓▓▀   ▐▓▓▌     ▓▓▌   ▐▓▓ ▐▓▓▓▀▀▀▓▓▌ ▓▓▓ ▀▓▓▌▀ ^▓▓▌  ╒▓▓▌
    13.         ▄▓▓▓▓▓▄▄▄▄▄▄▄▄▓▓▓      ▓▀      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌   ▐▓▓▄ ▓▓▌
    14.         ▀▓▓▓▓▀▀▀▀▀▀▀▀▀▀▓▓▄     ▓▓      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌    ▐▓▓▐▓▓
    15.           ^█▓▓▓        ▀▓▓▄   ▐▓▓▌     ▓▓▓▓▄▓▓▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▓▄    ▓▓▓▓`
    16.             '▀▓▓▓▄      ^▓▓▓  ▓▓▓       └▀▀▀▀ ▀▀ ^▀▀    `▀▀ `▀▀   '▀▀    ▐▓▓▌
    17.                ▀▀▀▀▓▄▄▄   ▓▓▓▓▓▓,                                      ▓▓▓▓▀
    18.                    `▀█▓▓▓▓▓▓▓▓▓▌
    19.                         ¬`▀▀▀█▓
    20.  
    21.  
    22. Version information:
    23.   ml-agents: 0.14.1,
    24.   ml-agents-envs: 0.14.1,
    25.   Communicator API: API-14,
    26.   TensorFlow: 2.0.1
    27. WARNING:tensorflow:From c:\users\mathias\anaconda3\envs\ml-agents14_1\lib\site-packages\tensorflow_core\python\compat\v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
    28. Instructions for updating:
    29. non-resource variables are not supported in the long term
    30. [... repeats for every environment instance]
    31. INFO:mlagents_envs:Connected new brain:
    32. Football?team=0
    33. INFO:mlagents_envs:Connected new brain:
    34. Football?team=1
    35. [... repeats for every environment instance]
    36. INFO:mlagents.trainers:Hyperparameters for the GhostTrainer of brain Football:
    37.         trainer:        ppo
    38.         batch_size:     2048
    39.         beta:   0.005
    40.         buffer_size:    20480
    41.         epsilon:        0.2
    42.         hidden_units:   256
    43.         lambd:  0.95
    44.         learning_rate:  0.0003
    45.         learning_rate_schedule: constant
    46.         max_steps:      10.0e7
    47.         memory_size:    256
    48.         normalize:      False
    49.         num_epoch:      3
    50.         num_layers:     2
    51.         time_horizon:   1000
    52.         sequence_length:        64
    53.         summary_freq:   10000
    54.         use_recurrent:  False
    55.         vis_encode_type:        simple
    56.         reward_signals:
    57.           extrinsic:
    58.             strength:   1.0
    59.             gamma:      0.99
    60.         summary_path:   football_sp_2x256_Football
    61.         model_path:     ./models/football_sp_2x256/Football
    62.         keep_checkpoints:       100
    63.         self_play:
    64.           window:       10
    65.           play_against_current_self_ratio:      0.5
    66.           save_steps:   50000
    67.           swap_steps:   50000
    68. 2020-03-07 09:01:21.638211: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
    69. INFO:mlagents.trainers:Loading Model for brain Football?team=1
    70. INFO:mlagents.trainers:Loading Model for brain Football?team=0
    71. INFO:mlagents.trainers: football_sp_2x256: Football: Step: 2660000. No episode was completed since last summary. Training.
    72. INFO:mlagents.trainers: Learning brain Football?team=1 ELO: 1200.000
    73. ...
     

    Attached Files:

  3. niels_modlai

    niels_modlai

    Joined:
    Oct 8, 2019
    Posts:
    5
    I found my issue while debugging trainers/ghost/trainer.py:_process_trajectory(). I wasn't careful enough with my reward signal so a winner was never detected. Maybe it's the same issue for you mbaske?
     
  4. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    473
    Did you actually send a reward before? I'm setting +1 (win) and -1 (loose), but not seeing any ELO change.
     
  5. WalkLearner

    WalkLearner

    Joined:
    Mar 12, 2020
    Posts:
    10
    I don't know if you have already solved the issue. In order to make ELO work, you need to add 0 for draw, +1 for winning and -1 for losing respectively to both teams as the last reward signal.