Search Unity

  1. We are migrating the Unity Forums to Unity Discussions by the end of July. Read our announcement for more information and let us know if you have any questions.
    Dismiss Notice
  2. Dismiss Notice

Question ML-agent not improving at all

Discussion in 'ML-Agents' started by Forestherd, Nov 29, 2020.

  1. Forestherd

    Forestherd

    Joined:
    May 1, 2016
    Posts:
    4
    Hello!

    I am a university student currently working on my thesis which is creating a volleyball-esque game and adding an opponent AI using Unity's ML-Agents. At the moment I have all of the basic functionality in the game - the player move, jump, dash in any of the cardinal directions, interact with the ball and score points. I've set up an environment for teaching a model to play the game, however I have had no luck at getting a half-way decently working AI opponent - the mean reward never meaningfully increases! So here I am, asking for assistance.

    I shall add my current code down below. Right now my reward function gives a small amount of points depending on how long the ball was in play and gives increasing points if the agent is closer to the ball, to incentives interacting with the ball. The reward structure will probably change once I get one decent result, but right now the agent isn't even able to keep the ball in play for any reasonable amount of time. Considering the goal is to create an AI opponent capable of really playing the game and getting points, this isn't a very good result.

    PlayerAgent.cs
    Code (CSharp):
    1.  
    2. using UnityEngine;
    3. using Unity.MLAgents;
    4. using Unity.MLAgents.Sensors;
    5. using UnityEngine.InputSystem;
    6.  
    7. public class PlayerAgent : Agent
    8. {
    9.     public Ball ball;
    10.     public Player otherPlayer;
    11.     public Player agentPlayer;
    12.     public bool XFlipped;
    13.    
    14.     private float xFlipMul;
    15.     private float accountedPoints;
    16.  
    17.     public override void CollectObservations(VectorSensor sensor)
    18.     {
    19.         // Agent
    20.         var pos = agentPlayer.transform.localPosition;
    21.         sensor.AddObservation(pos.x * xFlipMul);
    22.         sensor.AddObservation(pos.y);
    23.        
    24.         var vel = agentPlayer.velocity.current;
    25.         sensor.AddObservation(vel.x * xFlipMul);
    26.         sensor.AddObservation(vel.y);
    27.  
    28.         // Opponent
    29.         pos = otherPlayer.transform.localPosition;
    30.         sensor.AddObservation(pos.x * xFlipMul);
    31.         sensor.AddObservation(pos.y);
    32.  
    33.         vel = otherPlayer.velocity.current;
    34.         sensor.AddObservation(vel.x * xFlipMul);
    35.         sensor.AddObservation(vel.y);
    36.  
    37.         // Ball
    38.         pos = ball.transform.localPosition;
    39.         sensor.AddObservation(pos.x * xFlipMul);
    40.         sensor.AddObservation(pos.y);
    41.        
    42.         vel = ball.velocity.current;
    43.         sensor.AddObservation(vel.x * xFlipMul);
    44.         sensor.AddObservation(vel.y);
    45.     }
    46.  
    47.     public override void OnActionReceived(float[] vectorAction)
    48.     {
    49.         Vector2 movement = new Vector2();
    50.         movement.x = Mathf.Abs(vectorAction[0]) > .5f ? vectorAction[0] * xFlipMul : 0;
    51.         movement.y = Mathf.Abs(vectorAction[1]) > .5f ? vectorAction[1] : 0;
    52.         agentPlayer.inputManager.SetMovementKey(movement);
    53.         agentPlayer.inputManager.SetJumpKey(vectorAction[2]);
    54.         agentPlayer.inputManager.SetDashKey(vectorAction[3]);
    55.         float checkPoints = ball.Game.leftPoint;
    56.         float otherPoints = ball.Game.rightPoint;
    57.  
    58.         if (agentPlayer.Game.Player2.Equals(agentPlayer))
    59.         {
    60.             checkPoints = ball.Game.rightPoint;
    61.             otherPoints = ball.Game.leftPoint;
    62.         }
    63.        
    64.         // Reached target
    65.         if (checkPoints > accountedPoints)
    66.         {
    67.             EndEpisode();
    68.             accountedPoints = checkPoints;
    69.         }
    70.  
    71.         AddReward(0.01f);
    72.  
    73.         float dist = (agentPlayer.transform.position - ball.transform.position).magnitude;
    74.         float threshold = 5f;
    75.         if (dist < threshold)
    76.         {
    77.             AddReward(0.2f * Time.deltaTime * (1 - dist / threshold));
    78.         }
    79.        
    80.     }
    81.  
    82.     public override void Heuristic(float[] actionsOut)
    83.     {
    84.         actionsOut[0] = (Keyboard.current.rightArrowKey.isPressed ? 1 : 0) - (Keyboard.current.leftArrowKey.isPressed ? 1 : 0);
    85.         actionsOut[1] = (Keyboard.current.upArrowKey.isPressed ? 1 : 0) - (Keyboard.current.downArrowKey.isPressed ? 1 : 0);
    86.         actionsOut[2] = Keyboard.current.zKey.isPressed ? 1 : 0;
    87.         actionsOut[3] = Keyboard.current.xKey.isPressed ? 1 : 0;
    88.     }
    89.  
    90.     public override void OnEpisodeBegin()
    91.     {
    92.         xFlipMul = XFlipped ? -1f : 1f;
    93.  
    94.         ball.Game.Reset(xFlipMul);
    95.         ball.Game.leftPoint = 0;
    96.         ball.Game.rightPoint = 0;
    97.         accountedPoints = 0;
    98.     }
    99. }
    100.  

    configuration.yaml
    Code (CSharp):
    1. default_settings: null
    2. behaviors:
    3.   PlayerBehaviour:
    4.     trainer_type: ppo
    5.     hyperparameters:
    6.       batch_size: 32
    7.       buffer_size: 512
    8.       learning_rate: 0.0003
    9.       beta: 0.005
    10.       epsilon: 0.2
    11.       lambd: 0.99
    12.       num_epoch: 500
    13.       learning_rate_schedule: constant
    14.     network_settings:
    15.       normalize: false
    16.       hidden_units: 128
    17.       num_layers: 2
    18.       vis_encode_type: simple
    19.       memory: null
    20.     reward_signals:
    21.       extrinsic:
    22.         gamma: 0.99
    23.         strength: 1.0
    24.       curiosity:
    25.         gamma: 0.99
    26.         strength: 0.02
    27.         encoding_size: 256
    28.         learning_rate: 0.0003
    29.     init_path: null
    30.     keep_checkpoints: 5
    31.     checkpoint_interval: 500000
    32.     max_steps: 10000000
    33.     time_horizon: 32
    34.     summary_freq: 10000
    35.     threaded: true
    36.     self_play:
    37.       save_steps: 10000
    38.       team_change: 20000
    39.       swap_steps: 2000
    40.       window: 20
    41.       play_against_latest_model_ratio: 0.5
    42.       initial_elo: 1200.0
    43.     behavioral_cloning: null
    44.     framework: tensorflow
    45. env_settings:
    46.   env_path: null
    47.   env_args: null
    48.   base_port: 5005
    49.   num_envs: 1
    50.   seed: -1
    51. engine_settings:
    52.   width: 84
    53.   height: 84
    54.   quality_level: 5
    55.   time_scale: 20
    56.   target_frame_rate: -1
    57.   capture_frame_rate: 60
    58.   no_graphics: false
    59. environment_parameters: null
    60. checkpoint_settings:
    61.   run_id: 22Nov
    62.   initialize_from: null
    63.   load_model: false
    64.   resume: true
    65.   force: false
    66.   train_model: false
    67.   inference: false
    68. debug: false
    TensorBoard:
    upload_2020-11-29_17-54-3.png +

    Here is a link to the repo of the project, if you wish to try and test things out on your own or see how the game work: https://github.com/TanelMarran/Voll-AI

    I am using Unity Version 2019.4.15f1.

    Here are all of the dependencies in the python venv I use to train my models:
    Code (Boo):
    1. Package                Version
    2. ---------------------- ---------
    3. absl-py                0.11.0
    4. astunparse             1.6.3
    5. attrs                  20.3.0
    6. cachetools             4.1.1
    7. cattrs                 1.0.0
    8. certifi                2020.11.8
    9. chardet                3.0.4
    10. cloudpickle            1.6.0
    11. future                 0.18.2
    12. gast                   0.3.3
    13. google-auth            1.23.0
    14. google-auth-oauthlib   0.4.2
    15. google-pasta           0.2.0
    16. grpcio                 1.33.2
    17. gym                    0.17.3
    18. gym-unity              0.21.1
    19. h5py                   2.10.0
    20. idna                   2.10
    21. Keras-Preprocessing    1.1.2
    22. Markdown               3.3.3
    23. mlagents               0.21.1
    24. mlagents-envs          0.21.1
    25. numpy                  1.18.0
    26. oauthlib               3.1.0
    27. opt-einsum             3.3.0
    28. Pillow                 8.0.1
    29. pip                    20.2.4
    30. protobuf               3.14.0
    31. pyasn1                 0.4.8
    32. pyasn1-modules         0.2.8
    33. pyglet                 1.5.0
    34. pypiwin32              223
    35. pywin32                300
    36. PyYAML                 5.3.1
    37. requests               2.25.0
    38. requests-oauthlib      1.3.0
    39. rsa                    4.6
    40. scipy                  1.5.4
    41. setuptools             49.2.1
    42. six                    1.15.0
    43. tensorboard            2.4.0
    44. tensorboard-plugin-wit 1.7.0
    45. tensorflow             2.3.1
    46. tensorflow-estimator   2.3.0
    47. termcolor              1.1.0
    48. urllib3                1.26.2
    49. Werkzeug               1.0.1
    50. wheel                  0.35.1
    51. wrapt                  1.12.1
    If there is any other info you would like me to share, please let me know. Thank you in advance!
     
  2. mattinjersey

    mattinjersey

    Joined:
    Dec 3, 2016
    Posts:
    42
    I wonder if you should give a negative value when it drops the ball.
     
  3. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    473
    Hi, a couple of things...

    You're mixing continuous with discrete actions. Your behaviour parameters are set to space type = continuous, but the game logic expects discrete actions. Apparently you're converting the former to the latter by doing action value > 0.5 conditionals. Change the space type to discrete instead, and create action branches for movement, jump and dash.
    https://github.com/Unity-Technologi...onment-Design-Agents.md#discrete-action-space
    Your hyperparameter settings for batch_size and buffer_size should be fine for discrete actions, but are likely too low when using continuous actions.
    https://github.com/Unity-Technologi...uration-File.md#common-trainer-configurations

    You're not normalizing agent observations, therefore position and velocity values vary too much for the learning algorithm to make sense of. They should be constrained to a range between -1 and +1. You can set the hyperparameter normalize = true, which will tell the algorithm to adapt to the observation values it is receiving over time. Or you can normalize the values yourself in your agent class, before adding them to the vector sensor (my personal preference). You could simply divide them by the maximum possible values for a linear mapping of e.g. zero distance -> max distance to 0 -> 1. In some cases though, a linear mapping is not ideal for what the agent needs to be aware of. For instance, a ball distance change from 1m to 2m is more critical than a change from 10m to 11m. An non-linear mapping would make more sense here. I often rely on a sigmoid function for this, it has a high resolution for small values and flattenes out for large ones.
    float normalized = value / (1f + Mathf.Abs(value));

    Make sure to localize observations, so they are relative to the agent's frame of reference. I think you're doing this already with the xFlipMul field. Observations must not be different depending on what side the agent is playing at.
     
    bhaskar-nair2 and tambetm like this.
  4. Forestherd

    Forestherd

    Joined:
    May 1, 2016
    Posts:
    4
    I'm pretty sure I tried this once before, however that didn't seem to help much. I'll give it a go again when I get the time however!

    Thank you so much for the feedback! Currently I do not have the time to try these out, however over the weekend I'll try out any suggestions I get on this post and report the results.
     
  5. celion_unity

    celion_unity

    Joined:
    Jun 12, 2019
    Posts:
    289
    I don't think self-play and your current reward structure go well together. Self-play should result in increasing ELO (which it looks like it's doing) but not necessarily increasing reward; it just uses reward to determine the winner, and expects the rewards to be zero-sum between the teams.

    I would recommend that you do one of:
    * disable self-play.
    * change your rewards so that agents get +1 reward for winning a point (is that the right volleyball term?) and -1 reward for losing.

    I think that the second one is what you actually want, since it should eventually train agents to win.
     
  6. hk1ll3r

    hk1ll3r

    Joined:
    Sep 13, 2018
    Posts:
    88
    A bunch of things seems off with your reward and episode logic:
    * the existential reward should be negative, pushing the agent to try to end the episode as fast as possible. You are currently incentivizing the agent to do nothing (i.e. survive). It is hard for the agent to pick up on what actions are good or bad because they all get the same reward until the episode ends. Have a negative existential reward and positive rewards for specific game related things like distance to ball, kicking the ball, scoring. Explicit big negative reward (-1) for getting scored on.
    * The ball distance metric currently considers both x and y and is negative until the ball is at distance 1 from the player. Probably a better approach would be to only consider x delta (for a 2D game) and omit the height from the distance metric. As long as the player learns to go under the ball, that's good enough. You are currently penalizing kicking the ball high.
    * Check that 1 is a good value for threshold of where to turn the ball distance reward positive. Depends on the size and scale of your gameobjects.
    * You end the episode only when the current player scores. What if the other agent scores? You should EndEpisode on BOTH agents if any of them scores. Currently each agent learns that by losing points they can extend their episode and accumulate more rewards (because your existential reward is positive and they don't get a punishment for getting scored on and won't end their episode when they lose).
    * Ending the episodes for both agents also applied to when MaxStep is reached. Set the MaxStep to 0 on your agents and handle it through your environment so that both agents start and end episodes at the same time.
    * You can simply end episodes on each point of the game. The optimal behavior to score a single point also plays well in a game to 10 or 15 points. Don't complicate the game for your tini tiny ml-agents learning brain.

    Ml-agents repository has a lot of good working example environments. Look at one of them to set your environment correctly. Very helpful.
    https://github.com/Unity-Technologies/ml-agents

    I made a slime volleyball with mlagents AI back in 2019. Check it out: https://hk1ll3r.itch.io/slime-volleyball-feat-neural-ai