Search Unity

  1. We are migrating the Unity Forums to Unity Discussions. On July 12, the Unity Forums will become read-only. On July 15, Unity Discussions will become read-only until July 18, when the new design and the migrated forum contents will go live. Read our full announcement for more information and let us know if you have any questions.

PPO algorithm question: gradient update and Time-Horizon

Discussion in 'ML-Agents' started by MarkTension, Apr 24, 2020.

  1. MarkTension

    MarkTension

    Joined:
    Aug 17, 2019
    Posts:
    43
    Hi all,

    I’m trying to understand how gradient updates are exactly implemented to make sure I’m using the right horizon / batch-size settings:

    From the documentation:

    - Batch_size is the number of experiences used for one iteration of a gradient descent update.
    - Time_horizon corresponds to how many steps of experience to collect per-agent before adding it to the experience buffer.


    If e.g. batch size is set to 32, and horizon to 64:
    Does that mean that each of the 32 samples in batch size includes one random horizon set of 64 experiences with one corresponding total (expected) reward to this set?

    If so, since a bigger horizon contains more steps and therefore more rewards, wil a longer time horizon give greater variance at each gradient update?
    In a reward-dense environment I’ll probably also have to decrease my learning rate / extrinsic-reward-strength when I increase time-horizon right?
    The docs make it seem like the expected reward of horizon will be the expected reward until the end of an entire episode (agent reset?) but that seems strange to me, since then each horizon will have a very different expected reward depending on how early in the episode it starts.

    Thanks!
     
    Last edited: Apr 24, 2020
  2. vincentpierre

    vincentpierre

    Joined:
    May 5, 2017
    Posts:
    160
    When a trajectory becomes too long (longer than the time horizon) the trajectory will be truncated and added to the replay buffer but the future expected return at the end of the truncated trajectory will be estimated using the critic. You should not be worried that a shorter time horizon will contain less rewards although you are right to say that a longer time horizon will have more variance and a shorter one will have more bias (since using the critic). It is a trade off with a hard to strike balance. I can't see a reason why the learning rate and extrinsic reward strength should change if the time horizon changes.