Search Unity

Not sure about SetReward and AddReward functionality

Discussion in 'ML-Agents' started by Xiromtz, Apr 12, 2020.

  1. Xiromtz

    Xiromtz

    Joined:
    Feb 1, 2015
    Posts:
    65
    Hey guys,
    Just asking a simple question about SetReward and AddReward in regards to Reinforcement Learning functionality.

    So obviously, every time-step (or after a batched amount of timesteps), the NN receives state+reward at each timestep t.
    If I add reward 0.1 at timestep 0 and then addreward 0.1 at timestep 1, is it like: reward t_0 = 0.1, reward t_1 = 0.2? So the cumulative reward would be 0.3?
    So if i SetReward 0.1 at t_0 and at t_1, would my cumulative reward be 0.2?

    Does this mean, if I want to add a negative reward of -0.1 at every time-step, I should do this with SetReward(-0.1), instead of AddReward(-0.1), since AddReward would actually not result in a linear increase of reward, but actually exponential?

    Also, should I set the reward in the
    CollectObservations
    or
    OnActionReceived
    function? Shouldn't there simply be an additional function for this?
     
  2. andrewcoh_unity

    andrewcoh_unity

    Unity Technologies

    Joined:
    Sep 5, 2019
    Posts:
    162
    AddReward and SetReward modify the reward for a single timestep. At the next timestep, the reward is reset to 0.0.

    If r_t = 0.0 is your reward at timestep t, AddReward(value) accumulates reward as:

    r_t += value

    whereas SetReward(value) sets reward as:

    r_t = value

    Whether or not you should set the reward in CollectObservations or OnActionReceived depends on the environment/reward. For per timestep penalties, we usually use OnActionReceived since this usually corresponds to max step except when the Take Actions Between checkbox is not checked on the decision requester.
     
  3. Xiromtz

    Xiromtz

    Joined:
    Feb 1, 2015
    Posts:
    65
    @andrewcoh_unity Thanks for the reply!
    Is a "timestep" defined as the DecisionPeriod set in the DecisionRequester?
    Where one step in the set period would be a single EnvironmentStep call on the academy (i.e. fixedupdate per default)?

    Also, I don't quite understand how taking actions between decisions works - isn't the definition of a decision deciding on the next action to do (i.e. one input->output run through the Python interface)? How is the action vector decided upon betwen decisions?

    What do you mean with *Corresponding to max step*?

    My current understanding of the academy flow is:
    - Setup environment, connect with python, etc.
    Loop:
    - EnvironmentStep is called
    - If there is no episode running, start a new one
    - Once EnvironmentStep count is >= decision period, continue
    - CollectObservations is called, "consuming" the reward up till now, setting it back to 0
    - The received observations and rewards are sent to the python interface to go through the NN
    - Wait for the return of action vectors from python
    - Call OnActionReceived with the actions just received
    - wait for the next environmentstep and loop

    And somewhere inbetween batches, the Python interface does some backpropagation for training, this is also where freezes occur while training, since we have to wait for this to finish before requesting the next action.

    Is something wrong in my assumptions or did I miss something?
     
  4. andrewcoh_unity

    andrewcoh_unity

    Unity Technologies

    Joined:
    Sep 5, 2019
    Posts:
    162
    Sorry, I should have been more careful. By timestep, I was referring to the number of fixed updates that occur between decisions since an agent can accumulate reward in any of these fixed updates. After the decision interval has elapsed, the reward is reset to 0.

    The agent's policy will be queried for a new action given the current observation every decision interval. In the DecisionRequester, there is a checkbox for "Take Actions Between". If this is checked, the agent will continue executing the same action in between decisions.

    In the agent script, a max step field is available which corresponds to the maximum number of fixed updates before a new episode begins. OnActionReceived is called on every fixed update (if take actions between is checked).

    That understanding of the flow looks good to me. Let me know if I can clarify anything else.
     
    vkf63516 and samarth-robo like this.