Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. We have updated the language to the Editor Terms based on feedback from our employees and community. Learn more.
    Dismiss Notice
  3. Join us on November 16th, 2023, between 1 pm and 9 pm CET for Ask the Experts Online on Discord and on Unity Discussions.
    Dismiss Notice
  4. Dismiss Notice

Individual Reward is Increased but Group Reward is Decreased

Discussion in 'ML-Agents' started by Harry050299, Mar 25, 2021.

  1. Harry050299

    Harry050299

    Joined:
    Oct 22, 2020
    Posts:
    3
    I am training a group of 4 agents in an 'encirclement' problem. Whereby the agents have to circle a target at a given radius and angular velocity, and they should be in a certain formation defined by the angular difference between adjacent neighbors.

    I have an individual reward function which I'm pretty happy with, and it seems to generate relatively good results so far. I also add a small 'hurry up' negative group penalty for each update, and I add a -1 group reward when an agent goes out of bounds (and end the episode) and a +1 group reward when all of the agents are in good formation and are circling the target sufficiently.

    However, these are the results that I get:

    upload_2021-3-25_17-54-15.png

    It seems clear to me that the cumulative reward for individuals is being properly maximized, but almost oppositely, the cumulative group reward is being decreased, as though this was the intention. For this training I used the following config:

    Code (Boo):
    1. trainer_type: poca
    2. hyperparameters:
    3.     batch_size: 1024
    4.     buffer_size: 100000
    5.     learning_rate: 0.0001
    6.     beta: 0.01
    7.     epsilon: 0.2
    8.     lambd: 0.95
    9.     num_epoch: 3
    10.     learning_rate_schedule: constant
    11. network_settings:
    12.     normalize: false
    13.     hidden_units: 64
    14.     num_layers: 3
    15.     vis_encode_type: simple
    16.     memory: None
    17. reward_signals:
    18.     extrinsic:
    19.         gamma: 0.99
    20.         strength: 1.0
    21. keep_checkpoints: 5
    22. max_steps: 2000000000
    23. time_horizon: 64
    24. summary_freq: 50000
    25. threaded: true
    Is there any reason why it appears as though the group cumulative reward is being minimized? Is it my poor environment design or something in the algorithm that I am using?
     
  2. ervteng_unity

    ervteng_unity

    Unity Technologies

    Joined:
    Dec 6, 2018
    Posts:
    150
    Glad you're checking out the new multi-agent features. The Group Reward is much, much smaller than the individual reward - so the agents have learned to sacrifice it in favor of maximizing the individual rewards.

    I'd try removing the group penalties and/or dramatically decrease the magnitude of the individual rewards.
     
  3. Harry050299

    Harry050299

    Joined:
    Oct 22, 2020
    Posts:
    3
    Thanks for your reply @ervteng_unity ! I see what you mean. Since I am adding an individual reward in the range [0, 1] on each step, but only adding a group reward (+1 in the win scenario and -1 in the lose scenario) on the last step of each episode. How should I aim to distribute these individual and group rewards?
    Should the cumulative group reward be of a similar magnitude to the cumulative individual reward?

    What is the best balance for getting the agents to simultaneously maximize the individual reward as well as the group reward?

    On a side note, this slightly confuses me because if each agent does their job extremely well then the group *should* 'win'.
     
  4. Harry050299

    Harry050299

    Joined:
    Oct 22, 2020
    Posts:
    3
    I have altered how I give rewards, but I am still having problems converging both the agent reward and the group reward simultaneously.

    I simplified the problem such that now I just want the agents to move to the radius of an imaginary circle surrounding a target. The radius of this circle is determined before-hand. I have the following reward function to do this:

    Code (CSharp):
    1. private float EncirclementRadiusReward(float radiusError)
    2. {
    3.     if (radiusError <= 0.01f)
    4.     {
    5.          return (-1f * radiusError) + maxReward;
    6.     }
    7.  
    8.     return 0f;
    9. }
    Where the radius error is given by the following function:

    Code (CSharp):
    1. public float EncirclementRadiusError()
    2. {
    3.     var radius = Vector3.Distance(transform.position, m_targetPos);
    4.     var radiusErr = Mathf.Abs(radius - desiredEncirclementRadius);
    5.          
    6.     // normalise radiusError
    7.     var maxErr = maxDistToTarget - desiredEncirclementRadius;
    8.     const float minErr = 0f;
    9.  
    10.     return NormaliseFloat(radiusErr, maxErr, minErr);
    11. }
    This reward is applied to each agent individually. I also apply a 'hurry up penalty' to each agent at each step:

    Agent.AddReward(-0.5f / MaxEnvironmentSteps);


    In terms of group rewards, I also give the same hurry-up penalty to the entire group, I add a reward of +1 on an episode win and -1 on an episode loss (either by an agent moving out of bounds or the number of steps exceeding the max).

    A group win is given if all agents satisfy the following condition:

    Agent.EncirclementRadiusError() <= 0.01f


    (notice how the agent only receives a positive reward when the radius error is less than 0.01f, and the group only wins when all agents have radius error <= 0.01f).

    However, when I run training using the same config as above, the following results are observed:

    upload_2021-3-28_17-51-52.png

    I really can't understand how this is happening, is there something that I am missing here?

    Thankyou