Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.

Question Looking for advice on how to make ML-Agents learn in my turn-based game

Discussion in 'ML-Agents' started by MEGELUL, Jan 6, 2023.

  1. MEGELUL

    MEGELUL

    Joined:
    Feb 6, 2020
    Posts:
    3
    Hello.

    I've been struggling with implementing ML-Agents in a turn-based strategy game for the past month or so and was hoping that someone could give some advice to point me in the right direction. Most of the ML-Agents tutorials deal with real-time physics-based games and the few that do cover turn-based ones go into very little detail and are 2-3 years old.

    Some info on my game:

    -It's a turn based strategy where each player controls one or more units of different types (fighter, archer, etc.).
    -Each unit has different characteristics and abilities.
    -Each unit can act multiple times per turn (they have action points and each action costs some amount of action points).
    -During a player's turn they can control their units in any order.
    -The game is played on a gridless board, so if a player wants to move a unit they click a spot on the map and the unit moves to that Vector3 coordinate.
    -Movement also costs points, and each unit can move X units of distance per point spent depending on their movement speed.
    -There are randomness elements in my game - attacks do not always hit and do random amount of damage.

    Ideally a player should be able to make these decisions:
    -Which unit do I control?
    -Which action do I use?
    -If I move, then to which coordinate?
    -If I use an offensive action, then on which target?

    The game is controlled using a state manager class that passes the turn between player 1 and player 2 and handles ending and restarting the game when 1 of the sides wins (kills all opponent's units).

    ML-Agents integration:

    I implemented ML-Agents like this: the agent script is attached to a unit controller object (not the units themselves) and when the state manager detects that it's the agent's turn, it keeps calling that agent's RequestDecision method until the agent chooses to end its turn.

    Agent setup:

    Currently I am trying to train the agent in a simplified version of my environment where they only control one unit and the enemy also controls one unit on a small field.

    My agent's observation are: All relevant information on the units (their position, characteristics and type), the borders of the field beyond which the agent is not allowed to move and the distance between the units - 45 values in total.

    For the actions I use a combination of continuous and discrete actions - 2 continuous actions and 1 branch with 6 discrete actions. They are encoded like this: the discrete actions represent what the agent is allowed to do with the unit (0 - skip turn, 1 - move, 2-5 - use unit's abilities). If the agent chooses to move, then the coordinate is determined by the 2 continuous values - the first value gets translated into the distance the agent will move from its position (where 1 is the distance it can move if it spends all of its action points on movement) the second value gets translated into the angle (where -1 is 0 degrees and 1 is 360 degrees from its current position).

    I also apply an action mask so that the agent wouldn't be able to use combat actions if the enemy is out of range or he doesn't have enough resources and etc.

    Training the agent:

    Currently I am training the agent against a dummy script that always chooses random actions so the agent gets exposed to various different states.

    I want the agent to kill the enemy unit in under 30 turns. I assign a reward equal to the value of remaining turns divided by the maximum turns, so the agent gets 1 point if it manages to kill the enemy on its first turn and 0 points if he fails to kill the enemy. I also give a small penalty if it tries to move outside the bounds of the map.

    When training begins, during the first few thousand games the enemy's health is set to a value lower than maximum so the agent has an easier time of killing it and getting the reward associated with victory. Enemy gets more health in the subsequent few thousand games and after that they get full health.

    EndEpisode is called whenever the game ends - either the turn limit runs out or one of the parties lose all their units. EndEpisode is called AFTER I reinitialize the game field - when units are killed their gameobjects are destroyed, I have to respawn the units and because destroying and creating objects doesn't happen immediately in Unity, I have to make it skip a few frames for everything to reset correctly. Because of this, the game is actually restarted by the state manager class and OnEpisodeBegin simply signals that the next episode started and doesn't do anything beside that. There is a weird behavior associated with ML-Agents and removal/respawning of game objects - it keeps trying to access the destroyed for a while and throws a bunch of errors about trying to access nonexistent objects (even after adding a bunch of "if gameobject" checks inside the agent's methods), but after everything reinitializes it continues working properly, so I've been ignoring it for now. Moving the game restart logic to OnEpisodeBegin didn't fix the problem and only made some things worse (for example, setting the enemy's health to a lower value did not persist into the next frame). The vast majority of the errors happen inside WriteDiscreteActionMask.
    The problem:

    I want the agent to grasp the behavior of understanding where the enemy unit is right now, moving to that unit and then killing it using the best abilities for the job. I've gotten to the point, where the agent understands that there are different kinds of units in the game, it can use its offensive abilities and kill the enemy unit if both somehow end up near each other, but it doesn't know how to move to the enemy in order to be able to attack it.

    Whenever I train it, for a time it keeps wandering around the field and eventually comes to a behavior where it just starts moving to the field's edge and staying there or jittering in place by taking very small movements around its current position. If an enemy does come within its range it will attack but it refuses to learn to move to the enemy itself.

    upload_2023-1-6_13-27-46.png
    (white spheres represent the places it tried to move to)

    The reason I added a turn limit is because prior to that the agent would learn to simply not move at all and wait for the enemy to be near it in order to attack. With the turn limit it does move a lot but not in the way I expect it to.

    I thought about adding a reward for moving towards the enemy, but that probably would result in suboptimal behavior where it would learn to, for example, run towards an enemy melee fighter as an archer. I also noticed that rewarding the agent for specific small actions leads to scenarios where it doesn't actually try to win and instead tries to take those small actions as many times as possible.

    During one of my experiments I noticed that when sampling continuous actions at the beginning of training the agent sometimes makes too many similar decisions (for example, 99% of its outputted continuous actions where above zero, or below zero, sometimes the split is 80-20 or 70-30), so maybe that has something to do with it.

    Here is a result from my latest run using PPO

    upload_2023-1-6_9-52-3.png
    The parameters:

    trainer_type: ppo
    hyperparameters:
    batch_size: 1024
    buffer_size: 20480
    learning_rate: 0.0001
    beta: 0.1
    epsilon: 0.2
    lambd: 0.95
    num_epoch: 3
    shared_critic: false
    learning_rate_schedule: linear
    beta_schedule: linear
    epsilon_schedule: linear
    network_settings:
    normalize: true
    hidden_units: 128
    num_layers: 3
    vis_encode_type: simple
    memory: null
    goal_conditioning_type: hyper
    deterministic: false
    reward_signals:
    extrinsic:
    gamma: 0.99
    strength: 1.0
    network_settings:
    normalize: false
    hidden_units: 128
    num_layers: 2
    vis_encode_type: simple
    memory: null
    goal_conditioning_type: hyper
    deterministic: false
    init_path: null
    keep_checkpoints: 5
    checkpoint_interval: 500000
    max_steps: 500000
    time_horizon: 256
    summary_freq: 10000
    threaded: false
    self_play: null
    behavioral_cloning: null

    Here is a summary of a SAC run (note: I ended it early cause it was taking way too much time and wasn't really improving)

    upload_2023-1-6_9-58-0.png
    Parameters:

    trainer_type: sac
    hyperparameters:
    batch_size: 512
    buffer_size: 512000
    learning_rate: 0.0003
    buffer_init_steps: 5000
    init_entcoef: 1.0
    tau: 0.005
    steps_per_update: 1
    learning_rate_schedule: constant
    save_replay_buffer: true
    network_settings:
    normalize: false
    hidden_units: 256
    num_layers: 3
    vis_encode_type: simple
    memory: null
    goal_conditioning_type: hyper
    deterministic: false
    reward_signals:
    extrinsic:
    gamma: 0.99
    strength: 1.0
    network_settings:
    normalize: false
    hidden_units: 128
    num_layers: 2
    vis_encode_type: simple
    memory: null
    goal_conditioning_type: hyper
    deterministic: false
    init_path: null
    keep_checkpoints: 5
    checkpoint_interval: 500000
    max_steps: 500000
    time_horizon: 256
    summary_freq: 10000
    threaded: true
    self_play: null
    behavioral_cloning: null

    What I have tried:

    -Using both PPO and SAC - I observed slightly better results with SAC but it sill eventually comes to its "staying in a corner" behavior. SAC is also very slow and makes unity freeze a lot.
    -Using stacked vector observation - As I understood it would make it consider the observations from several prior actions.
    -Changing hyperparameters according to recommendations from the documentation and tutorials
    -Various reward configurations: giving negative rewards if the agent runs out of time and full rewards for a win, giving positive rewards for attacking
    -Normalizing observation values
    -Changing my state manager's state handling function from Update to FixedUpdate

    None of that helped

    Could it be that my environment is simply too complex for the agent to solve on its own and I have to simplify the action space? I would be very grateful for any advice or ideas anyone has about my scenario.
     
  2. hughperkins

    hughperkins

    Joined:
    Dec 3, 2022
    Posts:
    175
    You might want to try using a curriculum, starting from a very small, tiny map, with just two simple units, and gradually ramping up the complexity. Even if that fails to make the complicated scenario work, you will gain confidence by seeing the simpler scenarios work; and gain experience and learnings as the scenarios become more difficult.

    PPO can probably work for this, but it has to either 1. be playing a scenario where it can get rewards by chance prior to the heat death of the universe, or 2. have been pretrained on an easier scenario such that 1. now becomes true.
     
  3. MEGELUL

    MEGELUL

    Joined:
    Feb 6, 2020
    Posts:
    3
    The interesting thing is that in the scenario I described it does gain rewards and quite often in fact, especially during the early training stages where enemy's health is set to a small percentage of its maximum, yet it still fails to learn to move correctly to the target using the combination of continuous and discrete values in order to attack it.

    Another thing I have tried is: train the agent in the same scenario but add 1 additional discrete action which simply makes it move to the target and the results were MUCH better, however it learned to ONLY rely on that action and barely ever used free movement in the coordinate space. I then tried to train it again, initializing from that model, but I hid that action and it started walking into the map borders again.

    If I were to use a curriculum, do you think that first training it to ONLY move (negative reward for trying to move off the field, positive reward for moving correctly, positive reward for moving to the enemy when it has the advantage) and then training it to do everything else could work?
     
  4. lycettthomas94

    lycettthomas94

    Joined:
    Jun 13, 2020
    Posts:
    5
    I would experiment with using a discrete action space, in my experience, it's easier for the agent to learn. you can still get fine movement by moving a small distance with each update. i would do something like having the agent choose a direction [left, right, up, down] to move a fixed distance in (repeat until it used up it's movement points or it ends the turn), and then if it can learn to do that make it more complex from there.