Search Unity

ppo agent mean reward decreasing/not increasing

Discussion in 'ML-Agents' started by aideb, May 8, 2021.

  1. aideb

    aideb

    Joined:
    Mar 21, 2019
    Posts:
    11
    Hello everyone,

    I'm trying to train a fairly simple agent that follows the player's movements (using ml-agents 1.0.6).

    The agent's position set in OnEpisodeBegin and is based on the current position of the target (the player):

     transform.localPosition = new Vector3(Random.Range(target.transform.localPosition.x - 5, target.transform.localPosition.x + 5), Random.Range(target.transform.localPosition.y + 1, target.transform.localPosition.y + 3), Random.Range(target.transform.localPosition.z + 1,target.transform.localPosition.z + 4)); 


    My rewards system is this:
    • +1 for when the distance between the player and the agent is less than the specified value
    • -1 when the distance between the player and the agent is equal to or greater than the specified value
    My issue is that when I'm training the agent, the mean reward does not increase over time, but decreases instead. How could I fix this? Any help is appreciated. Thanks in advance!

    My agent code (BulletAgent.cs):

    Code (CSharp):
    1. using Unity.MLAgents.Sensors;
    2. using UnityEngine;
    3. using Unity.MLAgents;
    4.  
    5. /// <summary>
    6. /// Machine learning script for enemy agents
    7. /// </summary>
    8. public class BulletAgent : Agent //BaseAgent
    9. {
    10.     [SerializeField]
    11.     private GameObject target = null;
    12.  
    13.     private float distanceRequired = 4.5f;
    14.  
    15.     private Rigidbody playerRigidbody;
    16.  
    17.     public override void Initialize()
    18.     {
    19.         playerRigidbody = GetComponent<Rigidbody>();
    20.     }
    21.  
    22.     public override void OnEpisodeBegin()
    23.     {
    24.         transform.LookAt(target.transform);
    25.  
    26.         transform.localPosition = new Vector3(Random.Range(target.transform.localPosition.x - 5, target.transform.localPosition.x + 5), Random.Range(target.transform.localPosition.y + 1, target.transform.localPosition.y + 3), Random.Range(target.transform.localPosition.z + 1,target.transform.localPosition.z + 4));
    27.  
    28.     }
    29.  
    30.     public override void CollectObservations(VectorSensor sensor)
    31.     {
    32.         sensor.AddObservation(transform.localPosition);
    33.  
    34.         sensor.AddObservation(target.transform.localPosition);
    35.  
    36.         sensor.AddObservation(playerRigidbody.velocity.x);
    37.  
    38.         sensor.AddObservation(playerRigidbody.velocity.z);
    39.  
    40.         sensor.AddObservation(playerRigidbody.velocity.y);
    41.     }
    42.  
    43.  
    44.     public override void OnActionReceived(float[] vectorAction)
    45.     {
    46.         var vectorForce = new Vector3(vectorAction[0], vectorAction[2], vectorAction[1]);
    47.  
    48.         vectorForce.x = vectorAction[0];
    49.         vectorForce.z = vectorAction[1];
    50.         vectorForce.y = vectorAction[2];
    51.  
    52.         playerRigidbody.AddForce(vectorForce); //playerRigidbody.AddForce(vectorForce * speed);
    53.  
    54.         var distanceFromTarget = Vector3.Distance(transform.localPosition, target.transform.localPosition);
    55.  
    56.         if (distanceFromTarget < distanceRequired)
    57.         {
    58.             SetReward(1);
    59.             EndEpisode();
    60.             Debug.Log("SUCCESS, distance is " + distanceFromTarget + " required is " + distanceRequired);
    61.  
    62.         }
    63.  
    64.         else
    65.         {
    66.             SetReward(-1);
    67.             EndEpisode();
    68.             Debug.Log("failure, distance is " + distanceFromTarget + " required is " + distanceRequired);
    69.         }
    70.  
    71.        
    72.     }
    73.  
    74.     public override void Heuristic(float[] actionsOut)
    75.     {
    76.         actionsOut[0] = Input.GetAxis("Horizontal"); // x
    77.         actionsOut[1] = Input.GetAxis("Vertical"); // z
    78.     }
    79. }
    80.  
    My trainer configuration (FollowPlayer.yaml):


    behaviors:
    FollowPlayer:
    trainer_type: ppo
    hyperparameters:
    batch_size: 10
    buffer_size: 100
    learning_rate: 3.0e-4
    beta: 0.00001
    epsilon: 0.2
    lambd: 0.99
    num_epoch: 3
    learning_rate_schedule: linear
    network_settings:
    normalize: true
    hidden_units: 128
    num_layers: 2
    reward_signals:
    extrinsic:
    gamma: 0.99
    strength: 1.0
    max_steps: 500000
    time_horizon: 64
    summary_freq: 5000


    TensorBoard graphs from my most recent attempt:
     
  2. andrewcoh_unity

    andrewcoh_unity

    Unity Technologies

    Joined:
    Sep 5, 2019
    Posts:
    162
    Hi @aideb

    It looks like your episode lengths are each 0. Is there an EndEpisode in an unintentional place or is the max step on the agent set to 1? The agent cannot learn to do anything if it doesn't have any steps to act within an episode.

    Is this expected and I am misunderstanding something?
     
  3. aideb

    aideb

    Joined:
    Mar 21, 2019
    Posts:
    11
    Hi @andrewcoh_unity, thanks for the reply. What should I set my max step value to? As in, what number for Max Step would work best? :)
     
  4. aideb

    aideb

    Joined:
    Mar 21, 2019
    Posts:
    11
    I set my Max Step in the editor value to be the same as max_steps in the configuration file, but the episode length still seems to be 0. The only EndEpisode calls are in the part of the code that checks the agent's distance from the target, there are no other EndEpisode calls anywhere else.
     
  5. andrewcoh_unity

    andrewcoh_unity

    Unity Technologies

    Joined:
    Sep 5, 2019
    Posts:
    162
    Max step is the number of FixedUpdates for which an episode will last until the environment and agent are Reset (via OnEpisodeBegin). If it is set to 0, the environment will never reset until the agent reaches a termination condition e.g. achieves it's goal. The choice of max steps depends on the environment, but typically choose something large enough so that the agent has a reasonable chance of stumbling on it's goal by behaving randomly (which is how the agent will discover what to do). For our environments, we usually choose 5000 FixedUpdates and a decision interval of 5 so that the agent will experience 1000 timesteps before it is reset.

    Is this the issue? Was your max step set to 1?

    Alternatively, is EndEpIsode being called immediately on Reset? Judging from the way the reward function was described, it's possible that EndEpisode is being called in both cases which would be a problem. I believe it should only be called when the agent gets within the distance of the goal and not otherwise.
     
  6. andrewcoh_unity

    andrewcoh_unity

    Unity Technologies

    Joined:
    Sep 5, 2019
    Posts:
    162
    Max Step in the configuration file is not the same as the max step on the agent script. Max Step in the yaml is the total number of training steps (usually in the millions) whereas max step in the agent script is the number of fixed updates within an episode.

    I know that this is confusing, sorry.
     
  7. aideb

    aideb

    Joined:
    Mar 21, 2019
    Posts:
    11
    @andrewcoh_unity I've reworked my code and changed the
    batch_size
    and
    buffer_size
    values in the config file to
    128
    and
    1024
    respectively and now my episode lengths are all above 0. However, my mean reward seems to start out at around 0.9 at the very beginning of the training and stay mostly the same thorough the training instead of increasing (the standard deviation starts at and stays at around 0.135). My rewards system has remained unchanged except the punishment for being too far from the target is now -0.1 instead of -1. Entropy seems to have an overall trend of decreasing except for a spike here and there. What could be the cause of the mean reward not increasing?

    Updated code:

    Code (CSharp):
    1. using System.Collections;
    2. using System.Collections.Generic;
    3. using UnityEngine;
    4. using Unity.MLAgents;
    5. using Unity.MLAgents.Sensors;
    6.  
    7. public class BulletAgent : Agent
    8. {
    9.     public GameObject target;
    10.     public float strength = 50f;
    11.  
    12.     Rigidbody agentRigidbody;
    13.  
    14.     EnvironmentParameters defaultParams;
    15.  
    16.     public override void Initialize()
    17.     {
    18.         agentRigidbody = gameObject.GetComponent<Rigidbody>();
    19.         defaultParams = Academy.Instance.EnvironmentParameters;
    20.     }
    21.  
    22.     public override void CollectObservations(VectorSensor sensor)
    23.     {
    24.         sensor.AddObservation(target.transform.position);
    25.         sensor.AddObservation(gameObject.transform.position);
    26.     }
    27.  
    28.     public override void OnActionReceived(float[] vectorAction)
    29.     {
    30.         transform.position = new Vector3(Random.Range(target.transform.position.x - 10, target.transform.position.x + 10), Random.Range(target.transform.position.y - 4, target.transform.position.y +4), Random.Range(target.transform.position.z + 1, target.transform.position.z + 5)); //x is from -3 to +3, y is from -2 to +2
    31.     }
    32.  
    33.     public override void Heuristic(float[] actionsOut)
    34.     {
    35.         actionsOut[0] = Input.GetAxis("Horizontal");
    36.         actionsOut[1] = Input.GetKey(KeyCode.Space) ? 1.0f : 0.0f;
    37.         actionsOut[2] = Input.GetAxis("Vertical");
    38.     }
    39.  
    40.     void FixedUpdate()
    41.     {
    42.         var distanceFromTarget = Vector3.Distance(transform.position, target.transform.position);
    43.  
    44.         if (distanceFromTarget < 4.0f)
    45.         {
    46.             SetReward(1f);
    47.             EndEpisode();
    48.         }
    49.  
    50.         else
    51.         {
    52.             SetReward(-0.1f);
    53.         }
    54.     }
    55.  
    56.     private void Update()
    57.     {
    58.         if (orientation.magnitude > float.Epsilon)
    59.         {
    60.             gameObject.transform.rotation = Quaternion.Lerp(gameObject.transform.rotation,
    61.                 Quaternion.LookRotation(orientation),
    62.                 Time.deltaTime * 10f);
    63.         }
    64.     }
    65.  
    66.     public override void OnEpisodeBegin()
    67.     {
    68.         transform.LookAt(target.transform);
    69.         gameObject.transform.position =
    70.             new Vector3(Random.Range(target.transform.position.x - 5, target.transform.position.x + 5),
    71.             Random.Range(target.transform.position.y - 3, target.transform.position.y + 3),
    72.             Random.Range(target.transform.position.z + 1, target.transform.position.z + 4));
    73.         agentRigidbody.velocity = Vector3.zero;
    74.         var environment = gameObject.transform.parent.gameObject;
    75.  
    76.     }
    77. }
    78.  
    TensorBoard graphs are here.
     
  8. andrewcoh_unity

    andrewcoh_unity

    Unity Technologies

    Joined:
    Sep 5, 2019
    Posts:
    162
    It looks like the episode length is still <1 which shouldnt be. Is the agent spawning in the goal state (e.g. distanceFromTarget < 4.0f) and immediately reaching an EndEpisode?
     
  9. aideb

    aideb

    Joined:
    Mar 21, 2019
    Posts:
    11
    Yes, that could easily happen.
    I've modified the code to avoid situations like this - I've changed the setting of position in OnEpisodeBegin (line 69) to
     gameObject.transform.position = new Vector3(Random.Range(target.transform.position.x - 10, target.transform.position.x + 10), Random.Range(target.transform.position.y - 6, target.transform.position.y + 6), Random.Range(target.transform.position.z + 4, target.transform.position.z + 8)); 
    and in OnActionReceived (line 30) to
     transform.position = new Vector3(Random.Range(transform.position.x - 4, transform.position.x + 4), Random.Range(target.transform.position.y - 3, target.transform.position.y +3), Random.Range(target.transform.position.z + 1, target.transform.position.z + 3)); 
    and now my episode length is above 0 but the mean reward still doesn't seem to increase.
     
  10. ervteng_unity

    ervteng_unity

    Unity Technologies

    Joined:
    Dec 6, 2018
    Posts:
    150
    Looks like your batch/buffer size is also super small (10/100). I'd try copying the hyperparameters from something similar - maybe the Pushblock or FoodCollector examples?
     
  11. aideb

    aideb

    Joined:
    Mar 21, 2019
    Posts:
    11
    Hi @ervteng_unity, I have changed batch and buffer size to 128 and 1024 respectively, but the issue still persists.
     
  12. ervteng_unity

    ervteng_unity

    Unity Technologies

    Joined:
    Dec 6, 2018
    Posts:
    150
    Try even bigger - 512 and 5120, for instance.

    Also, I noticed that your OnActionReceived method just randomizes the position of the agent - is this intentional? This method should move the agent based on the received actions - right now the agent is just randomly moving around and has no control over its position.
     
  13. aideb

    aideb

    Joined:
    Mar 21, 2019
    Posts:
    11
    @ervteng_unity, is this code better? When I tried training this, mean reward went from around 0.08 to around 0.2, but then fell back again to around 0.1 and stayed there for the rest of the training (around 7 million steps in total, config file the same as before).

    Code (CSharp):
    1. using System.Collections;
    2. using UnityEngine;
    3. using Unity.MLAgents;
    4. using Unity.MLAgents.Actuators;
    5. using Unity.MLAgents.Sensors;
    6.  
    7. public class BulletAgent : Agent
    8. {
    9.     public GameObject area;
    10.     public GameObject target;
    11.     public bool useVectorObs;
    12.     Rigidbody m_AgentRb;
    13.  
    14.  
    15.     public override void Initialize()
    16.     {
    17.         m_AgentRb = GetComponent<Rigidbody>();
    18.         m_statsRecorder = Academy.Instance.StatsRecorder;
    19.     }
    20.  
    21.     public override void CollectObservations(VectorSensor sensor)
    22.     {
    23.         if (useVectorObs)
    24.         {
    25.             sensor.AddObservation(StepCount / (float)MaxStep);
    26.         }
    27.     }
    28.  
    29.     IEnumerator GoalScoredSwapGroundMaterial(Material mat, float time)
    30.     {
    31.         yield return new WaitForSeconds(time);
    32.     }
    33.  
    34.     public void MoveAgent(ActionSegment<int> act)
    35.     {
    36.         var dirToGo = Vector3.zero;
    37.         var rotateDir = Vector3.zero;
    38.  
    39.         var action = act[0];
    40.         switch (action)
    41.         {
    42.             case 1:
    43.                 dirToGo = transform.forward * 1f;
    44.                 break;
    45.             case 2:
    46.                 dirToGo = transform.forward * -1f;
    47.                 break;
    48.             case 3:
    49.                 dirToGo = transform.up * 1f;
    50.                 break;
    51.             case 4:
    52.                 dirToGo = transform.up * -1f;
    53.                 break;
    54.         }
    55.         transform.Rotate(rotateDir, Time.deltaTime * 150f);
    56.         m_AgentRb.AddForce(dirToGo * 1.5f, ForceMode.VelocityChange);
    57.     }
    58.  
    59.     public override void OnActionReceived(ActionBuffers actionBuffers)
    60.  
    61.     {
    62.         AddReward(-1f / MaxStep);
    63.         MoveAgent(actionBuffers.DiscreteActions);
    64.     }
    65.  
    66.     void OnCollisionEnter(Collision col)
    67.     {
    68.         if(col.gameObject.name == "player")
    69.         {
    70.                 SetReward(1f);
    71.                 m_statsRecorder.Add("Goal/Correct", 1, StatAggregationMethod.Sum);
    72.                 EndEpisode();
    73.         }
    74.     }
    75.  
    76.     private void FixedUpdate()
    77.     {
    78.         var distanceToTarget = Vector3.Distance(transform.position, target.transform.position);
    79.  
    80.         if (distanceToTarget > 10)
    81.         {
    82.             SetReward(-0.1f);
    83.             m_statsRecorder.Add("Goal/Wrong", 1, StatAggregationMethod.Sum);
    84.             EndEpisode();
    85.         }
    86.     }
    87.  
    88.     public override void Heuristic(in ActionBuffers actionsOut)
    89.     {
    90.         var discreteActionsOut = actionsOut.DiscreteActions;
    91.         discreteActionsOut[0] = 0;
    92.     }
    93.  
    94.     public override void OnEpisodeBegin()
    95.     {
    96.         var agentOffset = -15f;
    97.         var blockOffset = 0f;
    98.         m_Selection = Random.Range(0, 2);
    99.  
    100.         transform.position = new Vector3(target.transform.position.x + Random.Range(-3f, 3f),
    101.             target.transform.position.y + 1f, target.transform.position.z + Random.Range(1f, 2f));
    102.         transform.rotation = Quaternion.Euler(0f, Random.Range(0f, 360f), 0f);
    103.         m_AgentRb.velocity *= 0f;
    104.  
    105.         m_statsRecorder.Add("Goal/Correct", 0, StatAggregationMethod.Sum);
    106.         m_statsRecorder.Add("Goal/Wrong", 0, StatAggregationMethod.Sum);
    107.     }
    108. }
    109.  
     
  14. ervteng_unity

    ervteng_unity

    Unity Technologies

    Joined:
    Dec 6, 2018
    Posts:
    150
    How does the agent behave?

    Also note that if you do something in OnCollisionEnter, it will only happen once - so the agent will have to go in and out of the area to achieve more reward.
     
  15. aideb

    aideb

    Joined:
    Mar 21, 2019
    Posts:
    11
    @ervteng_unity The agent is supposed to follow the player's movements and try to collide with the player. The reward system is: +1 for each collision with the player, -0.1 for every time the distance between the player and the agent becomes greater than 10, -1/MaxSteps as an existential penalty (I've changed this from the very first post, as well as most of the code :D ).

    The agent does follow the player and collide with the player, but the mean reward doesn't go above 0.2 when training :(