Search Unity

Question Elo stuck at 1200 in self-play scenario despite correct reward signals (ml-agents 0.14.1)

Discussion in 'ML-Agents' started by rvhKrish, Oct 27, 2022.

  1. rvhKrish

    rvhKrish

    Joined:
    Sep 30, 2022
    Posts:
    2
    Hello y'all,

    With their permission, I am currently attempting to get forum user mbaske's code (found at https://github.com/mbaske/ml-table-football) to have a trained nn model our team can start to use to actualise onto a physical, mechanised foosball table (I apologise if asking for support with ml-agents 0.14.1 is frowned upon, but as an electrical engineering major with no ML/Unity experience, mbaske's old project is the best kick-off point for us right now. I hope that the issue is one of approach and not a version-specific concern. Another team member is currently putting together a single-rod solution with the latest version, mirroring the approach of the KIcker team [found at https://www.engineering.com/story/the-kicker-story-foosball-and-deep-reinforcement-learning]).

    To run this project, I've created a virtual environment with Python 3.7.7, ml-agents 0.14.1, and Unity 2019.3.3f. The issue I am facing is that despite correct reward signalling (to my knowledge, the last reward signal sent is either a 1 for win, 0 for draw, or -1 for a loss) there is no change in elo. I made only the following change in mbaske's code to try to ensure this:
    upload_2022-10-27_12-55-31.png
    Besides this, the only reward signalling I have currently enabled is "OnGoal" - when an agent is scored on, it receives a -1, and when an agent scores, it receives a +1. I have only been adding debug logs/attempting to add logger messages to try and see/understand what's going on under the hood. I found the following in /ghost/trainer.py:
    upload_2022-10-27_12-48-22.png
    It appears that this is where elo calculation is done, but the flag I put there is never printed, which leads me to believe the "trajectory done" condition is never reached, or only reached at the maximum step or something. I'm not quite sure what a trajectory refers to despite my research, I assume it has to do with /trajectory.py, but I assume the issue is not in ml-agents 0.14.1 on the python side, but on the Csharp and Unity side i.e. the code using the library.
    Here is an example of a training session from Tensorboard:

    upload_2022-10-27_12-51-2.png

    The beginning of some graphs is really strange to me - that there seems to be multiple y values for a given value of x. The cumulative reward starts negative yet seems to zero out, which also I don't quite understand. I take this to be a result of the "zero sum" nature of self-play 1v1 with the same agent or something along those lines. Here it is clear the elo never changes.

    This is an example of the LOGGER output:
    upload_2022-10-27_12-54-50.png

    Please let me know if there is more useful information I can provide - I would sincerely appreciate help. Also, the model indeed does not seem to be getting "better" at Foosball, or at least a 10000 step trained model seems to go even with the 1,000,000 step model.
     

    Attached Files:

  2. smallg2023

    smallg2023

    Joined:
    Sep 2, 2018
    Posts:
    144
    what does your .yaml look like? seems like you might not be swapping teams which could be why you're not learning anything - have you tried the team based demos to see if they train for you?
     
  3. rvhKrish

    rvhKrish

    Joined:
    Sep 30, 2022
    Posts:
    2
  4. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    473
    Hi Krish,
    I'm not sure what's going on with ELO value. Unfortunately, I don't have an ML-Agents environment set at the moment and can't test this. Using only symmetrical rewards should result in the cumulative reward being 0. If I remember correctly, having negative cumulative rewards in self-play can mess with the ELO. I think the double plotting of y-values in Tensorboard was caused keeping previous training run data, back with the older ML-Agents versions. You probably need to delete the old folder contents, before starting a new run.
     
  5. OmarVector

    OmarVector

    Joined:
    Apr 18, 2018
    Posts:
    130
    I'm similar issue, but I dont get it too.

    Each swap, the Mean Group Reward change to negative and ELO Drop , then next swap, Mean Group Reward become positive and ELO Go Up......and it become like this during the life cycle of the training.

    Is anyone has explanation to this and how I can prevent that?
     
    GamerLordMat likes this.
  6. GamerLordMat

    GamerLordMat

    Joined:
    Oct 10, 2019
    Posts:
    185
    I had the exact same bug but I unfortunatly really dont know how it was fixed:

    How do you reference you Agents? is it hard coded or you get the reference by code? Some Find function seem random
    I set swap_steps: 250000000 to avoid swapping for single self-play

    I had also bugs in the code that the observations were not symmetric leading the agent to be confused after every step
     
  7. OmarVector

    OmarVector

    Joined:
    Apr 18, 2018
    Posts:
    130
    I've system that initialize agents prefab ,and it support team selection too.

    Atm, I'm trying to remove some extra observation I made to detect more objects instead of just agents and balls. and see if this will make any difference.
     
  8. GamerLordMat

    GamerLordMat

    Joined:
    Oct 10, 2019
    Posts:
    185
    please tell us if you found your bug. There are so many things that can go wrong like unoptimized hyper parameters, bugs in code or just that the agent cant learn. idk
     
  9. OmarVector

    OmarVector

    Joined:
    Apr 18, 2018
    Posts:
    130
    I set all observation to be the same as the original one, same issue.

    Now, I will try to work on serializing agent instead of initializing them at runtime and see if this will make any difference