Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. We have updated the language to the Editor Terms based on feedback from our employees and community. Learn more.
    Dismiss Notice

Question Question about negative rewards

Discussion in 'ML-Agents' started by EternalMe, Jul 9, 2022.

  1. EternalMe

    EternalMe

    Joined:
    Sep 12, 2014
    Posts:
    181
    Lets say there is a episode and add bunch for + rewards, but later -1 and end the episode. So if the overall reward score in `time_horizon` is positive, this -1 will probably have no much impact, right?
     
  2. EternalMe

    EternalMe

    Joined:
    Sep 12, 2014
    Posts:
    181
  3. RNiel

    RNiel

    Joined:
    Feb 3, 2021
    Posts:
    1
    It depends a bit on the situation.

    Basically there are 2 parameters that are the most relevant here, how large the negative reward is relative to the previous positive reward and the discount factor. Where the main question for the discount factor is whether it is planning short-term or long-term.

    For the actions that receive the positive rewards (and presumably should change to prevent the negative reward) the discounted negative reward should be of "sufficient" size compared to the discounted sum of positive rewards. The "sufficient" size here depends on the task/scenario. If evading the negative reward is relatively simple to do by accident (during exploration) and doesn't necessarily result in the loss of positive rewards a relatively modest discounted negative reward should suffice.

    If evading the negative reward can't be done without losing positive rewards it will require a larger discounted negative reward to make preventing the negative reward "profitable".

    Finally if evading the negative reward is hard to do by accident and therefor occurs rarely during training or on the way to evading this negative reward there is less potential positive reward resulting the policy might not learn to prevent the negative reward at all. In this situation, from the point of view of the learning algorithm, the negative reward may as well be a constant since it never observes a state in which it didn't receive the negative reward.
     
    EternalMe likes this.
  4. EternalMe

    EternalMe

    Joined:
    Sep 12, 2014
    Posts:
    181
    And as I discovered there is the "gamma" hyperparameter, that actually sets how much future rewards will effect the current actions. (clap)
     
    Last edited: Dec 17, 2022