Search Unity

Resolved What PPO approach is used an more

Discussion in 'ML-Agents' started by Bardent, Oct 26, 2021.

  1. Bardent

    Bardent

    Joined:
    Mar 27, 2020
    Posts:
    5
    Hi all!

    So I am using ML-Agents for my final year project for my engineering degree. I have made a very simple little fighting game that consists of two players that can do nothing, move left or right, perform a heavy, medium, or light attack, and can finally block.

    The idea was to use ppo and self-play to train the agents and to hopefully end up seeing the little guys develop some sort of knowledge of the moveset and what it is capable of. I have it set up so that each attack has some sort of frame advantage on hit or block that kind of makes it easy to counter as long as you know the pattern.

    Currently so far I am not having a whole lot of success with the training. The two players usually end up being able to beat each other up for a little bit before getting too scared of each other and just doing nothing. Here are some reward structures that I have tried:

    1. The simple win/lose reward. +1 for the agent that wins, -1 for the one that loses, and 0 if it is a draw. Like is recommended in the self play docs.
    2. Progressive - If a player has 100 health, 10 damage corresponds to 0.1 reward or penalty depending on if hit or got hit.
    3. Adding a small distance penalty that scales up the further away from each other they are.
    4. Adding a small constant time penalty
    5. Adding a small penalty when the player performs an attack but does not hit anything
    Now here are some rewards I have thought of but have not experimented with yet.
    1. Adding a small penalty if the agents holds the block for too long
    2. Halving the penalty
    Okay that was just some info now for the actual questions.

    1. I'm assuming the approach of the PPO algorithm used in PPO-Clip? I just want to confirm my understanding is right.
    2. In terms of self-play, is there a way in Unity to indicate which agent is currently the one that is training? I would love to just put a little arrow in the head for interest.
    3. Any advice on hyperparameters? I will post some of my results with hyperparameters at the end.
    4. How complicated should my NN be? I might be thinking my problem is more complicated than it should be. I have 9 inputs. They go myXpos (Normalized between -1 and 1), myVelocity (Normalized between -1 and 1), myCurrentAction (Taken from enum, normalized between 0 and max actions), myHealth (normalized between 0 and 1), repeat for opponent, distance between players (normalized between -1 and 1). The output is just one action. The player cannot move and attack at the same time.
    5. How many timesteps would you say this would need to train. I have done runs up to 50 million just based on the PPO paper.. but I'm thinking its too many.
    6. With self play, do we really expect the average cumulative reward to go up all the time? Or would we expect it to oscillate around 0 as the opponents are usually skill matched so you can expect to win 50% of the time? Or am I thinking about this wrong?

    Here are some of my results:

    First Run (SR0). This one used the simple win/lose reward. I'm also not too sure what is happening with my episode length.. I'm thinking maybe because it's not a multiple of my summary frequency but I don't get it so if you know please let me know.

    upload_2021-10-26_15-18-35.png

    Hyperparameters (SR0):

    behaviors:
    SkripsieFighter:
    trainer_type: ppo
    hyperparameters:
    batch_size: 512
    buffer_size: 10240
    learning_rate: 3e-4
    beta: 5.0e-3
    epsilon: 0.2
    lambd: 0.9
    num_epoch: 3
    learning_rate_schedule: linear
    network_settings:
    normalize: false
    hidden_units: 256
    num_layers: 2
    reward_signals:
    extrinsic:
    gamma: 0.995
    strength: 1.0
    max_steps: 50000000
    time_horizon: 2048
    summary_freq: 20000
    self_play:
    save_steps: 20000
    team_change: 100000
    swap_steps: 40000
    window: 15
    play_against_latest_model_ratio: 0.4
    environment_parameters:
    max_env_steps: 10000.0
    win_reward: 1.0
    damaged_reward_multiplier: 0.0
    attack_missed_reward: 0.0
    distance_reward: 0.0
    time_reward: 0.0
    block_hold_reward: 0.0
    max_block_time: 0.0
    penalty_multiplier: 1.0

    Second Run (SR1). This one experimented with having the half penalty as it felt like agents were too scared to get hit, they would not try to hit the other one.
    upload_2021-10-26_15-23-19.png

    Hyperparameters (SR1):
    behaviors:
    SkripsieFighter:
    trainer_type: ppo
    hyperparameters:
    batch_size: 512
    buffer_size: 10240
    learning_rate: 3e-4
    beta: 5.0e-3
    epsilon: 0.2
    lambd: 0.9
    num_epoch: 3
    learning_rate_schedule: linear
    network_settings:
    normalize: false
    hidden_units: 128
    num_layers: 2
    reward_signals:
    extrinsic:
    gamma: 0.995
    strength: 1.0
    max_steps: 50000000
    time_horizon: 1024
    summary_freq: 50000
    self_play:
    save_steps: 20000
    team_change: 100000
    swap_steps: 40000
    window: 5
    play_against_latest_model_ratio: 0.4
    environment_parameters:
    max_env_steps: 5000.0
    win_reward: 1.0
    damaged_reward_multiplier: 0.0
    attack_missed_reward: 0.0
    distance_reward: 0.0
    time_reward: 0.0
    block_hold_reward: 0.0
    max_block_time: 0.0
    penalty_multiplier: 0.5

    Run (SR2): With this one I tried the progressive reward with half penalty.
    upload_2021-10-26_15-26-22.png

    env_settings:
    env_path: FinalBuild2/SKRIPSIE
    num_envs: 1

    engine_settings:
    width: 960
    height: 540
    time_scale: 5

    torch_settings:
    device: cpu

    behaviors:
    SkripsieFighter:
    trainer_type: ppo
    hyperparameters:
    batch_size: 512
    buffer_size: 10240
    learning_rate: 3e-4
    beta: 5.0e-3
    epsilon: 0.2
    lambd: 0.9
    num_epoch: 3
    learning_rate_schedule: linear
    network_settings:
    normalize: false
    hidden_units: 128
    num_layers: 2
    reward_signals:
    extrinsic:
    gamma: 0.995
    strength: 1.0
    max_steps: 50000000
    time_horizon: 1024
    summary_freq: 50000
    self_play:
    save_steps: 20000
    team_change: 100000
    swap_steps: 40000
    window: 5
    play_against_latest_model_ratio: 0.4
    environment_parameters:
    max_env_steps: 5000.0
    win_reward: 0.0
    damaged_reward_multiplier: 1.0
    attack_missed_reward: 0.0
    distance_reward: 0.0
    time_reward: 0.0
    block_hold_reward: 0.0
    max_block_time: 0.0
    penalty_multiplier: 0.5

    Some more runs and tears T.T between these.

    My latest runs seem a little bit more promising. After digging through the docs some more I came across the curiosity and and memory features and tried those.

    Run (SR8): This run is still running on my computer as I write this. This one uses progressive rewards with memory added.
    upload_2021-10-26_15-29-31.png
    Hyperparameters (SR8):
    behaviors:
    SkripsieFighter:
    trainer_type: ppo
    hyperparameters:
    batch_size: 512
    buffer_size: 10240
    learning_rate: 1.0e-4
    beta: 1.0e-2
    epsilon: 0.2
    lambd: 0.9
    num_epoch: 3
    learning_rate_schedule: linear
    network_settings:
    normalize: false
    hidden_units: 512
    num_layers: 2
    memory:
    memory_size: 128
    sequence_length: 64
    reward_signals:
    extrinsic:
    gamma: 1.0
    strength: 1.0
    max_steps: 50000000
    time_horizon: 1024
    summary_freq: 50000
    self_play:
    save_steps: 100000
    team_change: 500000
    swap_steps: 50000
    window: 30
    play_against_latest_model_ratio: 0.5
    environment_parameters:
    max_env_steps: 10000.0
    win_reward: 0.0
    damaged_reward_multiplier: 1.0
    attack_missed_reward: 0.0
    distance_reward: 0.0
    time_reward: 0.0
    block_hold_reward: 0.0
    max_block_time: 0.0
    penalty_multiplier: 1.0

    So SR8 is so far my most promising run but it's still not great. My agents still really like hiding in opposite sides of the map XD. The wins are not actually zero.. just resumed the run.
    upload_2021-10-26_15-33-24.png
    If anybody is keen to chat about this with me or can maybe enlighten me to what I might be doing wrong that would be absolutely wonderful! Thank you so much for your time and let me know if you need any other information from me :)
     
  2. Bardent

    Bardent

    Joined:
    Mar 27, 2020
    Posts:
    5
    I did not show the results of all my experiments there, they are just more of the same no matter the reward structure.
     
  3. unity_-DoCqyPS6-iU3A

    unity_-DoCqyPS6-iU3A

    Joined:
    Aug 18, 2018
    Posts:
    26
    I would start simple.

    In this case, I would
    - limit the episode length, but give 1/"maximum episode length" as penalty every step (to encourage quick actions)
    - end the episode as soon as one agent scores a hit, punish the agent that got hit with -1. (this shows the agent that *every* step prior to the action is important for learning. Let's you forget about time-horizon)
    - give reward for the agent for a successful hit in the final step. maybe give a higher reward for "harder" attacks.
    - start fighting against a "null" agent, that does nothing. That way your agent can learn to hit the opponent.

    In the second run, let your AI fight against the pre-trained agent.

    Edit: also make sure that your "left" / "right" actions lead to the same movement (either to the "own" side, or the "opponent" side) regardless which agent is training. otherwise you can't apply self-learning. Maybe a symmetric environment will help in the beginning as well
     
    Bardent likes this.
  4. KristophTG

    KristophTG

    Joined:
    May 23, 2020
    Posts:
    2
    One thing i noticed playing a bit with the reinforcement learning in Unity, if you want to implement a 2 player game like this and the agents share the same neural network (same behaviour name basically), and you are using vector observations, you need to make sure to flip all the observations in a way where its indistinguishable for the agent on which side it has spawned. Same for the actions. Otherwise it fails to learn anything.

    Basically just think if you would have an agent that is just a bunch of 'if' statements on the inputs, would it produce the same behaviour if it changed sides, if not, you must make it so.