Question Configuring training parameters for maze-like environment.

StewedHarry · Aug 25, 2020

I'm trying to create an agent which can traverse a maze-like environment, but the resulting behaviour has not lived up to my expectations.

I have developed a environment creating script which produces random maze-like levels. These can be varied in difficulty - both in terms of the overall size, and the number of obstacles and dead ends.

Here is an example curriculum:

1) Initial empty environment with two exits and no obstacles:

2) Environment with simple corridors:

3) Environment with a difficulty of 1 (e.g. one of the free tiles is randomly filled):

4) Difficulty of 2:

Eventually the levels grow in size and are supposed to culminate in something like this:

The agent's basic observations are:

8 raycasts in every direction around it's body

It's own position

The location of the nearest exit

The distance to the nearest exit

The positions of the nearest 10 free tiles

In order to give the agent some persistent memory, I created my own version of the stacked vectors provided in the Behaviour parameters component. This stacks the position of the agent, but rather than stack the positions over time it stacks them over distance. For example, it will stack the agents position if it has moved to a new position, so that it has some memory of where it has been.

I have had some moderate success using PPO with these parameters:
behaviors:

  Spy:

    trainer_type: ppo

    hyperparameters:

      batch_size: 128

      buffer_size: 2048

      learning_rate: 0.0003

      beta: 0.01

      epsilon: 0.2

      lambd: 0.95

      num_epoch: 3

      learning_rate_schedule: linear

    network_settings:

      normalize: false

      hidden_units: 512

      num_layers: 2

      vis_encode_type: simple

      memory: null

    reward_signals:

      extrinsic:

        gamma: 0.99

        strength: 1.0

      curiosity:

        gamma: 0.99

        strength: 0.2

        encoding_size: 256

        learning_rate: 0.0003

    keep_checkpoints: 5

    checkpoint_interval: 500000

    max_steps: 10000000

    time_horizon: 128

    summary_freq: 30000

    threaded: true

    
With these settings the agent can become quite proficient in traversing certain types of maze - ones where the path to the exit is fairly obvious, and the potential for going down a dead-end is small. However, it often quickly becomes undone if it takes the wrong turn down a dead end. The longer the route to the dead end the less often it seems to be able to backtrack and find the right path.

I'm also trying out SAC with these parameters, although it is fairing worse than the PPO agent:
behaviors:

  Spy:

    trainer_type: sac

    hyperparameters:

      learning_rate: 0.0001

      learning_rate_schedule: constant

      batch_size: 64

      buffer_size: 64000

      buffer_init_steps: 0

      tau: 0.005

      steps_per_update: 10.0

      save_replay_buffer: false

      init_entcoef: 0.01

      reward_signal_steps_per_update: 10.0

    network_settings:

      normalize: false

      hidden_units: 20

      num_layers: 2

      vis_encode_type: simple

    reward_signals:

      extrinsic:

        gamma: 0.99

        strength: 1.0

    keep_checkpoints: 5

    max_steps: 10000000

    time_horizon: 20

    summary_freq: 30000

    threaded: true
When I try to tweak the configuration files (with either PPO or SAC) I seem to reduce the effectiveness of training. I am essentially guessing based on some vague intuitions when it comes to tweaking the parameters.

I have tried introducing a CNN by adding these parameters:
memory:

        memory_size: 128

        sequence_length: 64
However this failed to improve training.

Are there any parameters which I could change to improve training for this type of scenario? I'm happy to provide images of the TensorBoard in the comments if that's needed (I've posted the maximum amount of images here)

ervteng_unity · Aug 25, 2020

Really cool project! The key here is the observation stacking - how many can it see at once, and how do you deal with a variable stack size? If the agent doesn't remember it entered a dead end, it won't know to go back and find another path. Either way, this is a really hard problem for model-free RL.

If you're using memory (LSTM) I'd remove the stacking, and note that you'll require many more training steps to get a trained model.

StewedHarry · Aug 25, 2020

ervteng_unity said: ↑

The key here is the observation stacking - how many can it see at once, and how do you deal with a variable stack size? If the agent doesn't remember it entered a dead end, it won't know to go back and find another path. Either way, this is a really hard problem for model-free RL.
Click to expand...

There are 20 observations in the stack, 2 for each tile position, so that accounts for 10 previously visited ties. This should be plenty for any dead end the agent is likely to find. New positions are also only added to the stack if the agent moves a certain distance away from any other position in the stack, so even if the agent moves around a lot while in a certain area, the stack will not replace itself with new positions from the same area. These are padded with 0's if the agent hasn't or can't move enough to fill the stack.

ervteng_unity said: ↑

Either way, this is a really hard problem for model-free RL.
Click to expand...

I realise this now. My initial plan for this project was to have the agent evade/sneak past other patrolling agents while traversing this environment. Pathfinding is proving tricky enough.

StewedHarry · Aug 25, 2020

ervteng_unity said: ↑

If you're using memory (LSTM) I'd remove the stacking, and note that you'll require many more training steps to get a trained model.
Click to expand...

I'll try removing the stack and being more patient with LSTM - it seems as though the agents ability to remember where it has been is crucial.

Search Unity

Question Configuring training parameters for maze-like environment.

StewedHarry

ervteng_unity

Unity Technologies

StewedHarry

StewedHarry

Search Unity

Unity ID

Useful Searches

Question Configuring training parameters for maze-like environment.

StewedHarry

ervteng_unity

Unity Technologies

StewedHarry

StewedHarry