Search Unity

  1. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

In depth explanation for time horizon hyperparameter

Discussion in 'ML-Agents' started by Billythekidzz1, Jan 29, 2020.

  1. Billythekidzz1


    Jan 24, 2020
    Hello, I was having a lot of difficulties understanding how the time horizon hyperparameter worked in ml-agents. From what I've read, time horizon is the length of the trajectory that is used to update the agent's policy. I'm unsure of how this works in the context of ml-agents and was hoping someone would be able to provide a detailed explanation.

    For example, a scenario with four goals which give a +1 reward each where the agent has to reach all four goals to complete the episode. If the time horizon is 128, maximum episode length is 1,000 and the agent reaches a goal at steps 300, 600 and 900, how does time horizon work in this case to capture the sequence of actions that allows the network to converge towards a solution?

    Additionally, why is it recommended to use a small time horizon for prohibitively large episodes (I'm assuming a prohibitively large episode means an episode with a near-infinite amount of steps)? How would this differ from using a large time horizon in this case?

  2. caioc2


    May 11, 2018
    Disclaimer: I cant tell exact how they implemented it, but I can give you a description of what horizon is in the literature.

    TL; DR: Horizon, in simple words, is how far in time steps a present reward can be associated with a past action (or a past action can be accounted for in the present reward)

    In you case, if you reached the goal at step 300, all probabilities of choosing the actions done in the steps 200 to 300 would be updated proportionally(exponential with gamma) for the reward received at step 300, and all actions before 200 would be ignored in this update. Now if you have an environment with unbounded episode length, and the horizon is too large for each update you would need to account all past actions up to the horizon, which in turn would need to be stored in some place and would blow your RAM, take a lot of time to process or worse, it would account actions that did not contributed to the reward at all.
    Last edited: Feb 5, 2020
    AVOlight, kpalko and mbaske like this.