Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. We have updated the language to the Editor Terms based on feedback from our employees and community. Learn more.
    Dismiss Notice

PPO algorithm question: gradient update and Time-Horizon

Discussion in 'ML-Agents' started by MarkTension, Apr 24, 2020.

  1. MarkTension


    Aug 17, 2019
    Hi all,

    I’m trying to understand how gradient updates are exactly implemented to make sure I’m using the right horizon / batch-size settings:

    From the documentation:

    - Batch_size is the number of experiences used for one iteration of a gradient descent update.
    - Time_horizon corresponds to how many steps of experience to collect per-agent before adding it to the experience buffer.

    If e.g. batch size is set to 32, and horizon to 64:
    Does that mean that each of the 32 samples in batch size includes one random horizon set of 64 experiences with one corresponding total (expected) reward to this set?

    If so, since a bigger horizon contains more steps and therefore more rewards, wil a longer time horizon give greater variance at each gradient update?
    In a reward-dense environment I’ll probably also have to decrease my learning rate / extrinsic-reward-strength when I increase time-horizon right?
    The docs make it seem like the expected reward of horizon will be the expected reward until the end of an entire episode (agent reset?) but that seems strange to me, since then each horizon will have a very different expected reward depending on how early in the episode it starts.

    Last edited: Apr 24, 2020
  2. vincentpierre


    Unity Technologies

    May 5, 2017
    When a trajectory becomes too long (longer than the time horizon) the trajectory will be truncated and added to the replay buffer but the future expected return at the end of the truncated trajectory will be estimated using the critic. You should not be worried that a shorter time horizon will contain less rewards although you are right to say that a longer time horizon will have more variance and a shorter one will have more bias (since using the critic). It is a trade off with a hard to strike balance. I can't see a reason why the learning rate and extrinsic reward strength should change if the time horizon changes.