Search Unity

Stabilising rewards towards end of training.

Discussion in 'ML-Agents' started by UltimateTom, Apr 15, 2020.

  1. UltimateTom

    UltimateTom

    Joined:
    Jan 22, 2019
    Posts:
    3
    I have an agent I am training to navigate a short course. It receives a small positive reward at checkpoints for making progress towards its end goal, 1 for reaching the finish and a small negative step penalty every decision. The agent trains well and learns quite quickly how to reach the end, however I'm having trouble getting consistent results towards the end of training and am unsure how to maximise its potential.

    Because my environment is very time sensitive and the agent travels quickly, the difference in a "great" (0.95 - ~23 seconds) score and a "good" (0.90 ~25 seconds) score becomes quite small, as they will receive the same total "checkpoint" reward, but have a small variance in step penalties. Episode length might vary by 20 or so, 980 for good, 960 for great, this can mean a gap of seconds in the finish time.

    Do you have some guidance for how I could flatten the result towards the end of training to be more consistent, or perhaps a way of communicating to the agent how important the difference in each step taken becomes as the episode duration shortens?

    Some training info below.

    The agent quickly learns to maximise rewards.

    Training01.png

    The results towards the end of training can have a high variance.

    Training02.png

    The agent has 2 Discrete Actions with 3 and 2 branches respectively.
    222 Vector Observations.
    Decision Interval 3.

    (Non Default) PPO Training Parameters:
    batch_size: 128
    beta: 1.0e-2
    buffer_size: 20480
    epsilon: 0.1
    hidden_units: 512
    time_horizon: 512
    max_steps: 1e7
    num_epoch: 3
     
  2. andrewcoh_unity

    andrewcoh_unity

    Unity Technologies

    Joined:
    Sep 5, 2019
    Posts:
    162
    Can you send your policy's entropy curve? It may help to lower beta a bit since this determines how 'random' your policy will be. Also, you can likely use a larger batch size for such a large buffer, say a batch size of 2048.
     
  3. UltimateTom

    UltimateTom

    Joined:
    Jan 22, 2019
    Posts:
    3
    I had initially set the batch_size low as per the PPO configuration guide in regards to Discrete vs Continuous recommendations. I take it I should have been using a smaller buffer_size when the batch_size is as small?

    I completed a sample run with the suggested parameter changes:
    batch_size: 2048
    beta: 1.0e-3

    The agent took a little longer to initially "figure out" the course;

    culmulativereward.png

    But ended up in a similar pattern. You can see had the environment ended two steps prior, the trained agent would be much slower.

    finalreward.png

    The entropy graph from this run.
    entropy.png


    Would there be some merit in attempting to widen the reward space somehow at the end of my training, or should the agent ideally be coping?