Resolved What PPO approach is used an more

Bardent · Oct 26, 2021

Hi all!

So I am using ML-Agents for my final year project for my engineering degree. I have made a very simple little fighting game that consists of two players that can do nothing, move left or right, perform a heavy, medium, or light attack, and can finally block.

The idea was to use ppo and self-play to train the agents and to hopefully end up seeing the little guys develop some sort of knowledge of the moveset and what it is capable of. I have it set up so that each attack has some sort of frame advantage on hit or block that kind of makes it easy to counter as long as you know the pattern.

Currently so far I am not having a whole lot of success with the training. The two players usually end up being able to beat each other up for a little bit before getting too scared of each other and just doing nothing. Here are some reward structures that I have tried:

The simple win/lose reward. +1 for the agent that wins, -1 for the one that loses, and 0 if it is a draw. Like is recommended in the self play docs.

Progressive - If a player has 100 health, 10 damage corresponds to 0.1 reward or penalty depending on if hit or got hit.

Adding a small distance penalty that scales up the further away from each other they are.

Adding a small constant time penalty

Adding a small penalty when the player performs an attack but does not hit anything

Now here are some rewards I have thought of but have not experimented with yet.

Adding a small penalty if the agents holds the block for too long

Halving the penalty

Okay that was just some info now for the actual questions.

I'm assuming the approach of the PPO algorithm used in PPO-Clip? I just want to confirm my understanding is right.

In terms of self-play, is there a way in Unity to indicate which agent is currently the one that is training? I would love to just put a little arrow in the head for interest.

Any advice on hyperparameters? I will post some of my results with hyperparameters at the end.

How complicated should my NN be? I might be thinking my problem is more complicated than it should be. I have 9 inputs. They go myXpos (Normalized between -1 and 1), myVelocity (Normalized between -1 and 1), myCurrentAction (Taken from enum, normalized between 0 and max actions), myHealth (normalized between 0 and 1), repeat for opponent, distance between players (normalized between -1 and 1). The output is just one action. The player cannot move and attack at the same time.

How many timesteps would you say this would need to train. I have done runs up to 50 million just based on the PPO paper.. but I'm thinking its too many.

With self play, do we really expect the average cumulative reward to go up all the time? Or would we expect it to oscillate around 0 as the opponents are usually skill matched so you can expect to win 50% of the time? Or am I thinking about this wrong?

Here are some of my results:

First Run (SR0). This one used the simple win/lose reward. I'm also not too sure what is happening with my episode length.. I'm thinking maybe because it's not a multiple of my summary frequency but I don't get it so if you know please let me know.

Hyperparameters (SR0):

behaviors:
SkripsieFighter:
trainer_type: ppo
hyperparameters:
batch_size: 512
buffer_size: 10240
learning_rate: 3e-4
beta: 5.0e-3
epsilon: 0.2
lambd: 0.9
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 256
num_layers: 2
reward_signals:
extrinsic:
gamma: 0.995
strength: 1.0
max_steps: 50000000
time_horizon: 2048
summary_freq: 20000
self_play:
save_steps: 20000
team_change: 100000
swap_steps: 40000
window: 15
play_against_latest_model_ratio: 0.4
environment_parameters:
max_env_steps: 10000.0
win_reward: 1.0
damaged_reward_multiplier: 0.0
attack_missed_reward: 0.0
distance_reward: 0.0
time_reward: 0.0
block_hold_reward: 0.0
max_block_time: 0.0
penalty_multiplier: 1.0

Second Run (SR1). This one experimented with having the half penalty as it felt like agents were too scared to get hit, they would not try to hit the other one.

Hyperparameters (SR1):
behaviors:
SkripsieFighter:
trainer_type: ppo
hyperparameters:
batch_size: 512
buffer_size: 10240
learning_rate: 3e-4
beta: 5.0e-3
epsilon: 0.2
lambd: 0.9
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 128
num_layers: 2
reward_signals:
extrinsic:
gamma: 0.995
strength: 1.0
max_steps: 50000000
time_horizon: 1024
summary_freq: 50000
self_play:
save_steps: 20000
team_change: 100000
swap_steps: 40000
window: 5
play_against_latest_model_ratio: 0.4
environment_parameters:
max_env_steps: 5000.0
win_reward: 1.0
damaged_reward_multiplier: 0.0
attack_missed_reward: 0.0
distance_reward: 0.0
time_reward: 0.0
block_hold_reward: 0.0
max_block_time: 0.0
penalty_multiplier: 0.5

Run (SR2): With this one I tried the progressive reward with half penalty.

env_settings:
env_path: FinalBuild2/SKRIPSIE
num_envs: 1

engine_settings:
width: 960
height: 540
time_scale: 5

torch_settings:
device: cpu

behaviors:
SkripsieFighter:
trainer_type: ppo
hyperparameters:
batch_size: 512
buffer_size: 10240
learning_rate: 3e-4
beta: 5.0e-3
epsilon: 0.2
lambd: 0.9
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 128
num_layers: 2
reward_signals:
extrinsic:
gamma: 0.995
strength: 1.0
max_steps: 50000000
time_horizon: 1024
summary_freq: 50000
self_play:
save_steps: 20000
team_change: 100000
swap_steps: 40000
window: 5
play_against_latest_model_ratio: 0.4
environment_parameters:
max_env_steps: 5000.0
win_reward: 0.0
damaged_reward_multiplier: 1.0
attack_missed_reward: 0.0
distance_reward: 0.0
time_reward: 0.0
block_hold_reward: 0.0
max_block_time: 0.0
penalty_multiplier: 0.5

Some more runs and tears T.T between these.

My latest runs seem a little bit more promising. After digging through the docs some more I came across the curiosity and and memory features and tried those.

Run (SR8): This run is still running on my computer as I write this. This one uses progressive rewards with memory added.

Hyperparameters (SR8):
behaviors:
SkripsieFighter:
trainer_type: ppo
hyperparameters:
batch_size: 512
buffer_size: 10240
learning_rate: 1.0e-4
beta: 1.0e-2
epsilon: 0.2
lambd: 0.9
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 512
num_layers: 2
memory:
memory_size: 128
sequence_length: 64
reward_signals:
extrinsic:
gamma: 1.0
strength: 1.0
max_steps: 50000000
time_horizon: 1024
summary_freq: 50000
self_play:
save_steps: 100000
team_change: 500000
swap_steps: 50000
window: 30
play_against_latest_model_ratio: 0.5
environment_parameters:
max_env_steps: 10000.0
win_reward: 0.0
damaged_reward_multiplier: 1.0
attack_missed_reward: 0.0
distance_reward: 0.0
time_reward: 0.0
block_hold_reward: 0.0
max_block_time: 0.0
penalty_multiplier: 1.0

So SR8 is so far my most promising run but it's still not great. My agents still really like hiding in opposite sides of the map XD. The wins are not actually zero.. just resumed the run.

If anybody is keen to chat about this with me or can maybe enlighten me to what I might be doing wrong that would be absolutely wonderful! Thank you so much for your time and let me know if you need any other information from me

Bardent · Oct 26, 2021

I did not show the results of all my experiments there, they are just more of the same no matter the reward structure.

unity_-DoCqyPS6-iU3A · Oct 29, 2021

I would start simple.

In this case, I would
- limit the episode length, but give 1/"maximum episode length" as penalty every step (to encourage quick actions)
- end the episode as soon as one agent scores a hit, punish the agent that got hit with -1. (this shows the agent that *every* step prior to the action is important for learning. Let's you forget about time-horizon)
- give reward for the agent for a successful hit in the final step. maybe give a higher reward for "harder" attacks.
- start fighting against a "null" agent, that does nothing. That way your agent can learn to hit the opponent.

In the second run, let your AI fight against the pre-trained agent.

Edit: also make sure that your "left" / "right" actions lead to the same movement (either to the "own" side, or the "opponent" side) regardless which agent is training. otherwise you can't apply self-learning. Maybe a symmetric environment will help in the beginning as well

KristophTG · Nov 15, 2021

One thing i noticed playing a bit with the reinforcement learning in Unity, if you want to implement a 2 player game like this and the agents share the same neural network (same behaviour name basically), and you are using vector observations, you need to make sure to flip all the observations in a way where its indistinguishable for the agent on which side it has spawned. Same for the actions. Otherwise it fails to learn anything.

Basically just think if you would have an agent that is just a bunch of 'if' statements on the inputs, would it produce the same behaviour if it changed sides, if not, you must make it so.

Search Unity

Unity ID

Useful Searches

Resolved What PPO approach is used an more

Bardent

Bardent

unity_-DoCqyPS6-iU3A

KristophTG