Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Join us on Thursday, June 8, for a Q&A with Unity's Content Pipeline group here on the forum, and on the Unity Discord, and discuss topics around Content Build, Import Workflows, Asset Database, and Addressables!
    Dismiss Notice

Stabilising rewards towards end of training.

Discussion in 'ML-Agents' started by UltimateTom, Apr 15, 2020.

  1. UltimateTom


    Jan 22, 2019
    I have an agent I am training to navigate a short course. It receives a small positive reward at checkpoints for making progress towards its end goal, 1 for reaching the finish and a small negative step penalty every decision. The agent trains well and learns quite quickly how to reach the end, however I'm having trouble getting consistent results towards the end of training and am unsure how to maximise its potential.

    Because my environment is very time sensitive and the agent travels quickly, the difference in a "great" (0.95 - ~23 seconds) score and a "good" (0.90 ~25 seconds) score becomes quite small, as they will receive the same total "checkpoint" reward, but have a small variance in step penalties. Episode length might vary by 20 or so, 980 for good, 960 for great, this can mean a gap of seconds in the finish time.

    Do you have some guidance for how I could flatten the result towards the end of training to be more consistent, or perhaps a way of communicating to the agent how important the difference in each step taken becomes as the episode duration shortens?

    Some training info below.

    The agent quickly learns to maximise rewards.


    The results towards the end of training can have a high variance.


    The agent has 2 Discrete Actions with 3 and 2 branches respectively.
    222 Vector Observations.
    Decision Interval 3.

    (Non Default) PPO Training Parameters:
    batch_size: 128
    beta: 1.0e-2
    buffer_size: 20480
    epsilon: 0.1
    hidden_units: 512
    time_horizon: 512
    max_steps: 1e7
    num_epoch: 3
  2. andrewcoh_unity


    Unity Technologies

    Sep 5, 2019
    Can you send your policy's entropy curve? It may help to lower beta a bit since this determines how 'random' your policy will be. Also, you can likely use a larger batch size for such a large buffer, say a batch size of 2048.
  3. UltimateTom


    Jan 22, 2019
    I had initially set the batch_size low as per the PPO configuration guide in regards to Discrete vs Continuous recommendations. I take it I should have been using a smaller buffer_size when the batch_size is as small?

    I completed a sample run with the suggested parameter changes:
    batch_size: 2048
    beta: 1.0e-3

    The agent took a little longer to initially "figure out" the course;


    But ended up in a similar pattern. You can see had the environment ended two steps prior, the trained agent would be much slower.


    The entropy graph from this run.

    Would there be some merit in attempting to widen the reward space somehow at the end of my training, or should the agent ideally be coping?