Search Unity

POCA Group Policy Collapse problem and query assign rewards prior to GroupEpisodeInterrupted()

Discussion in 'ML-Agents' started by JulesVerny, Feb 24, 2022.

  1. JulesVerny

    JulesVerny

    Joined:
    Dec 21, 2015
    Posts:
    47
    Hello

    I have set up a Prisoner escape scenario, where Two agents have to achieve a number of sub tasks to break out (Move to Crate, Push Crate to Wall, Jump on Crate, Climb Wall, etc) - I have set this up as a sequence of 12 sub objectives, which I set up and assign Group Rewards of 0.2f if the Prisoners achieves that Objective and call EndGroupEpisode( ). I start Training with just attempting to achieve the first Objective, and assigning that first 0.2f reward, if the agents succeeds. If the Group Agents achieve these objectives around 10 times in succession, I extend the scenario to the next objective level. Where there is an opportunity to get the additional +0.2f reward, but ONLY if they achieve the the new objective. So its a staggered progression of Difficulty, sequence and growth of rewards.

    When the Agents Fail to Achieve the Level Objective, an Action Decsion Count will be exceeded, where the GroupEpisodeInterrupted() is then called - This is similar to the Unity Dungeon Escape example. Where positive Rewards are assigned only, if the Objectives are met, but None are assigned upon failure or time out.

    This seems to work up to a certain level. The POCA Group Agents learn to achieve a sub objective consistently get promoted up the levels. However at a certain level the Policy collapses. The Group Rewards and Value, baseline Loss signals collapse. See below.
    Run5.PNG
    I have checked the level heuristically. It is not that demanding. But My Agents never seem to discover the level objective, and so I cannot assign ANY rewards upon that Episode (Even if all previous Sub Objectives, Partial Rewards may have been achieved accumulated)

    I was wondering if there is any benefit in assigning a partial reward (Accumulated Sub objectives) just prior to calling GroupEpisodeInterrupted() to recognise the sub Objectives reached so far. It feels intuitive, to provide some reward for the sub objectives achieved so far, but I suspect this would be adverse to assigning a penalty to not achieving the current level code, and assigning any reward is not the intended use of GroupEpisodeInterrupted.

    I do not really know the difference between EndGroupEpisode() and GroupEpisodeInterrupted() as to being able to assign Groups Rewards prior to calling either or both of these ?

    Any Suggestions Welcome.
     
    Last edited: Feb 24, 2022
  2. JulesVerny

    JulesVerny

    Joined:
    Dec 21, 2015
    Posts:
    47
    Anyone know the difference between EndGroupEpisode() and GroupEpisodeInterrupted() in respect to any Group Rewards assigned ?
    Unity ML-Agents Development team ?
     
  3. ChillX

    ChillX

    Joined:
    Jun 16, 2016
    Posts:
    145
    ml-agents/ml-agents/mlagents/trainers/optimizer/torch_optimizer.py

    in function get_trajectory_value_estimates()

    Done parameter is true if episode was interrupted



    if done:
    for k in next_value_estimate:
    if not self.reward_signals[k].ignore_done:
    next_value_estimate[k] = 0.0
    if agent_id in self.critic_memory_dict:
    self.critic_memory_dict.pop(agent_id)



    So in other words if the episode was interrupted then the next value estimate for PPO is different than if the episode was ended.

    I'm guessing it would be the same story for POCA
     
  4. ice_creamer

    ice_creamer

    Joined:
    Jul 28, 2022
    Posts:
    34
    Anyone can share the trainer.yaml contained RSA setting? Appreciate!