Self-play with automatic curriculum? [now with source & video]

mbaske · Jun 29, 2020

Hi, I'd be interested in your thoughts on automating curricula with symmetric self-play. I'm not super familiar with the theory behind this, so I might be reinventing the wheel here or miss some crucial aspects.

I have two humanoid agents pitted against each other in a boxing match. Episodes are open-ended and can terminate in one of three ways:
1) Whenever an agent strikes its opponent, I'm adding its fist velocity's magnitude to a float field that was nulled on episode start. Once the accumulated velocities exceed a given max value, that agent is the winner and the opponent loses. Even if the difference between both agents' accumulated velocities is minimal.
2) When an agent's head drops below some threshold height, it is considered down and a counter starts. If the agent doesn't manage to get its head up again for a given number of steps, it loses and the opponent wins.
3) While one agent is down, its opponent goes down as well. In this case, the episode ends immediately in a draw.

The challenge with this (besides fidgety physics) is that my values for max_accumulated_velocities and max_down_steps need to change as training progresses. In the beginning, agents lose their balance almost immediately and in order for them to understand that falling down is bad, max_down_steps has to be tiny. Otherwise, both agents go down and all episodes would end in a draw. Likewise, max_accumulated_velocities has to start low, so agents can learn that punching their opponent will result in a win. Later, agents need to learn striking harder and repeatedly, as well as how to get up after falling down, for which the respective max values have to be higher.

I'm trying to solve this with a seperate auto-curriculum class that receives OnEpisodeStop events from all agents in a scene, indicating why an episode has ended. This class has a couple of counter values: total_episode_count, win_by_strike_count, lose_by_down_count and draw_count. When an episode ends, it compares the counter values against each other and updates the max values for all agents accordingly. For instance, if more than some fraction of all episodes ended because of win_by_strike, then max_accumulated_velocities is incremented, otherwise it is decremented. Similarly for max_down_steps, its value is lowered if too many episodes ended in a draw.

Does this sound like a reasonable approach to the problem? Thanks!

andrzej_ · Jun 29, 2020

I actually experimented with something similar with having some additional metrics, beyond reward, that would affect the environment. From what you describe it should work in theory although getting the thresholds 'right', for all those values you track, might be a bit tricky.

If you manage to train a boxing agent, please do share. Sound like a cool idea

mbaske · Jun 29, 2020

andrzej_ said: ↑

From what you describe it should work in theory although getting the thresholds 'right', for all those values you track, might be a bit tricky.
Click to expand...

Yes, I noticed, like "one week of tweaking" kind of tricky.

andrzej_ said: ↑

If you manage to train a boxing agent, please do share. Sound like a cool idea
Click to expand...

Will do!

mbaske · Jul 4, 2020

OK, here's the source for my project.
https://github.com/mbaske/ml-selfplay-fighter

I was hoping to achieve something similar to this https://openai.com/blog/competitive-self-play/.
It's not quite there yet, but I'm putting development on hold for now. I think my basic observation, rewarding and curriculum approach is on the right track. There are issues with the physics setup though. Agents put a lot of effort into balancing themselves, which they do by sliding their legs forwards and backwards and by sticking their arms out. Which obviously prevents them from fighting more effectively. I'm not sure what to do about that... I guess simulating human body physics in Unity is hard - well, this is the best I can do at the moment. However, I think I saw agents deliberately dodging and blocking punches every now and then, so there's still hope for them.

I did quite a bit of pausing and resuming while training the model, because I made a couple of tweaks along the way and didn't want to retrain from scratch everytime. Hence I can't promise that retraining the policy with the current settings will result in exactly the same behaviour, but it should be close enough. There are just a ton of variables to keep track of. Changing drag a little for instance can cause pretty different movement styles, not to mention all the ML related parameters. I'd be thrilled if someone could improve upon this and post the results here!

As always, I made a short video for my youtube portfolio.

Search Unity

Unity ID

Useful Searches

Self-play with automatic curriculum? [now with source & video]

mbaske

andrzej_

mbaske

mbaske