Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. We have updated the language to the Editor Terms based on feedback from our employees and community. Learn more.
    Dismiss Notice

Question Configuring training parameters for maze-like environment.

Discussion in 'ML-Agents' started by StewedHarry, Aug 25, 2020.

  1. StewedHarry

    StewedHarry

    Joined:
    Jan 20, 2020
    Posts:
    45
    I'm trying to create an agent which can traverse a maze-like environment, but the resulting behaviour has not lived up to my expectations.

    I have developed a environment creating script which produces random maze-like levels. These can be varied in difficulty - both in terms of the overall size, and the number of obstacles and dead ends.

    Here is an example curriculum:

    1) Initial empty environment with two exits and no obstacles:
    upload_2020-8-25_13-36-25.png

    2) Environment with simple corridors:

    upload_2020-8-25_13-37-37.png

    3) Environment with a difficulty of 1 (e.g. one of the free tiles is randomly filled):

    upload_2020-8-25_13-38-17.png

    4) Difficulty of 2:

    upload_2020-8-25_13-38-36.png

    Eventually the levels grow in size and are supposed to culminate in something like this:
    upload_2020-8-25_13-40-14.png


    The agent's basic observations are:
    • 8 raycasts in every direction around it's body
    • It's own position
    • The location of the nearest exit
    • The distance to the nearest exit
    • The positions of the nearest 10 free tiles
    In order to give the agent some persistent memory, I created my own version of the stacked vectors provided in the Behaviour parameters component. This stacks the position of the agent, but rather than stack the positions over time it stacks them over distance. For example, it will stack the agents position if it has moved to a new position, so that it has some memory of where it has been.

    I have had some moderate success using PPO with these parameters:

    behaviors:
    Spy:
    trainer_type: ppo
    hyperparameters:
    batch_size: 128
    buffer_size: 2048
    learning_rate: 0.0003
    beta: 0.01
    epsilon: 0.2
    lambd: 0.95
    num_epoch: 3
    learning_rate_schedule: linear
    network_settings:
    normalize: false
    hidden_units: 512
    num_layers: 2
    vis_encode_type: simple
    memory: null
    reward_signals:
    extrinsic:
    gamma: 0.99
    strength: 1.0
    curiosity:
    gamma: 0.99
    strength: 0.2
    encoding_size: 256
    learning_rate: 0.0003
    keep_checkpoints: 5
    checkpoint_interval: 500000
    max_steps: 10000000
    time_horizon: 128
    summary_freq: 30000
    threaded: true


    With these settings the agent can become quite proficient in traversing certain types of maze - ones where the path to the exit is fairly obvious, and the potential for going down a dead-end is small. However, it often quickly becomes undone if it takes the wrong turn down a dead end. The longer the route to the dead end the less often it seems to be able to backtrack and find the right path.

    I'm also trying out SAC with these parameters, although it is fairing worse than the PPO agent:


    behaviors:
    Spy:
    trainer_type: sac
    hyperparameters:
    learning_rate: 0.0001
    learning_rate_schedule: constant
    batch_size: 64
    buffer_size: 64000
    buffer_init_steps: 0
    tau: 0.005
    steps_per_update: 10.0
    save_replay_buffer: false
    init_entcoef: 0.01
    reward_signal_steps_per_update: 10.0
    network_settings:
    normalize: false
    hidden_units: 20
    num_layers: 2
    vis_encode_type: simple
    reward_signals:
    extrinsic:
    gamma: 0.99
    strength: 1.0
    keep_checkpoints: 5
    max_steps: 10000000
    time_horizon: 20
    summary_freq: 30000
    threaded: true


    When I try to tweak the configuration files (with either PPO or SAC) I seem to reduce the effectiveness of training. I am essentially guessing based on some vague intuitions when it comes to tweaking the parameters.

    I have tried introducing a CNN by adding these parameters:

    memory:
    memory_size: 128
    sequence_length: 64


    However this failed to improve training.

    Are there any parameters which I could change to improve training for this type of scenario? I'm happy to provide images of the TensorBoard in the comments if that's needed (I've posted the maximum amount of images here)
     
  2. ervteng_unity

    ervteng_unity

    Unity Technologies

    Joined:
    Dec 6, 2018
    Posts:
    150
    Really cool project! The key here is the observation stacking - how many can it see at once, and how do you deal with a variable stack size? If the agent doesn't remember it entered a dead end, it won't know to go back and find another path. Either way, this is a really hard problem for model-free RL.

    If you're using memory (LSTM) I'd remove the stacking, and note that you'll require many more training steps to get a trained model.
     
  3. StewedHarry

    StewedHarry

    Joined:
    Jan 20, 2020
    Posts:
    45
    There are 20 observations in the stack, 2 for each tile position, so that accounts for 10 previously visited ties. This should be plenty for any dead end the agent is likely to find. New positions are also only added to the stack if the agent moves a certain distance away from any other position in the stack, so even if the agent moves around a lot while in a certain area, the stack will not replace itself with new positions from the same area. These are padded with 0's if the agent hasn't or can't move enough to fill the stack.

    I realise this now. My initial plan for this project was to have the agent evade/sneak past other patrolling agents while traversing this environment. Pathfinding is proving tricky enough.
     
  4. StewedHarry

    StewedHarry

    Joined:
    Jan 20, 2020
    Posts:
    45
    I'll try removing the stack and being more patient with LSTM - it seems as though the agents ability to remember where it has been is crucial.