ppo agent mean reward decreasing/not increasing

aideb · May 8, 2021

Hello everyone,

I'm trying to train a fairly simple agent that follows the player's movements (using ml-agents 1.0.6).

The agent's position set in OnEpisodeBegin and is based on the current position of the target (the player):
 transform.localPosition = new Vector3(Random.Range(target.transform.localPosition.x - 5, target.transform.localPosition.x + 5), Random.Range(target.transform.localPosition.y + 1, target.transform.localPosition.y + 3), Random.Range(target.transform.localPosition.z + 1,target.transform.localPosition.z + 4)); 
My rewards system is this:

+1 for when the distance between the player and the agent is less than the specified value

-1 when the distance between the player and the agent is equal to or greater than the specified value

My issue is that when I'm training the agent, the mean reward does not increase over time, but decreases instead. How could I fix this? Any help is appreciated. Thanks in advance!

My agent code (BulletAgent.cs):

Code (CSharp):

using Unity.MLAgents.Sensors;

using UnityEngine;

using Unity.MLAgents;

/// <summary>

/// Machine learning script for enemy agents

/// </summary>

public class BulletAgent : Agent //BaseAgent

{

[SerializeField]

private GameObject target = null;

private float distanceRequired = 4.5f;

private Rigidbody playerRigidbody;

public override void Initialize()

{

playerRigidbody = GetComponent<Rigidbody>();

}

public override void OnEpisodeBegin()

{

transform.LookAt(target.transform);

transform.localPosition = new Vector3(Random.Range(target.transform.localPosition.x - 5, target.transform.localPosition.x + 5), Random.Range(target.transform.localPosition.y + 1, target.transform.localPosition.y + 3), Random.Range(target.transform.localPosition.z + 1,target.transform.localPosition.z + 4));

}

public override void CollectObservations(VectorSensor sensor)

{

sensor.AddObservation(transform.localPosition);

sensor.AddObservation(target.transform.localPosition);

sensor.AddObservation(playerRigidbody.velocity.x);

sensor.AddObservation(playerRigidbody.velocity.z);

sensor.AddObservation(playerRigidbody.velocity.y);

}

public override void OnActionReceived(float[] vectorAction)

{

var vectorForce = new Vector3(vectorAction[0], vectorAction[2], vectorAction[1]);

vectorForce.x = vectorAction[0];

vectorForce.z = vectorAction[1];

vectorForce.y = vectorAction[2];

playerRigidbody.AddForce(vectorForce); //playerRigidbody.AddForce(vectorForce * speed);

var distanceFromTarget = Vector3.Distance(transform.localPosition, target.transform.localPosition);

if (distanceFromTarget < distanceRequired)

{

SetReward(1);

EndEpisode();

Debug.Log("SUCCESS, distance is " + distanceFromTarget + " required is " + distanceRequired);

}

else

{

SetReward(-1);

EndEpisode();

Debug.Log("failure, distance is " + distanceFromTarget + " required is " + distanceRequired);

}

}

public override void Heuristic(float[] actionsOut)

{

actionsOut[0] = Input.GetAxis("Horizontal"); // x

actionsOut[1] = Input.GetAxis("Vertical"); // z

}

}

My trainer configuration (FollowPlayer.yaml):
behaviors:

  FollowPlayer:

    trainer_type: ppo

    hyperparameters:

      batch_size: 10

      buffer_size: 100

      learning_rate: 3.0e-4

      beta: 0.00001

      epsilon: 0.2

      lambd: 0.99

      num_epoch: 3

      learning_rate_schedule: linear

    network_settings:

      normalize: true

      hidden_units: 128

      num_layers: 2

    reward_signals:

      extrinsic:

        gamma: 0.99

        strength: 1.0

    max_steps: 500000

    time_horizon: 64

    summary_freq: 5000
TensorBoard graphs from my most recent attempt:

Cumulative reward/Episode length: https://snipboard.io/ax8n4A.jpg

Policy loss/Value loss: https://snipboard.io/liXzeo.jpg

Beta/Entropy/Epsilon/Extrinsic reward/Extrinsic value estimate/Learning rate: https://snipboard.io/PbhiQf.jpg

andrewcoh_unity · May 10, 2021

Hi @aideb

It looks like your episode lengths are each 0. Is there an EndEpisode in an unintentional place or is the max step on the agent set to 1? The agent cannot learn to do anything if it doesn't have any steps to act within an episode.

Is this expected and I am misunderstanding something?

aideb · May 10, 2021

andrewcoh_unity said: ↑

Hi @aideb

It looks like your episode lengths are each 0. Is there an EndEpisode in an unintentional place or is the max step on the agent set to 1? The agent cannot learn to do anything if it doesn't have any steps to act within an episode.

Is this expected and I am misunderstanding something?
Click to expand...

Hi @andrewcoh_unity, thanks for the reply. What should I set my max step value to? As in, what number for Max Step would work best?

aideb · May 10, 2021

I set my Max Step in the editor value to be the same as max_steps in the configuration file, but the episode length still seems to be 0. The only EndEpisode calls are in the part of the code that checks the agent's distance from the target, there are no other EndEpisode calls anywhere else.

andrewcoh_unity · May 10, 2021

Max step is the number of FixedUpdates for which an episode will last until the environment and agent are Reset (via OnEpisodeBegin). If it is set to 0, the environment will never reset until the agent reaches a termination condition e.g. achieves it's goal. The choice of max steps depends on the environment, but typically choose something large enough so that the agent has a reasonable chance of stumbling on it's goal by behaving randomly (which is how the agent will discover what to do). For our environments, we usually choose 5000 FixedUpdates and a decision interval of 5 so that the agent will experience 1000 timesteps before it is reset.

Is this the issue? Was your max step set to 1?

Alternatively, is EndEpIsode being called immediately on Reset? Judging from the way the reward function was described, it's possible that EndEpisode is being called in both cases which would be a problem. I believe it should only be called when the agent gets within the distance of the goal and not otherwise.

andrewcoh_unity · May 10, 2021

Max Step in the configuration file is not the same as the max step on the agent script. Max Step in the yaml is the total number of training steps (usually in the millions) whereas max step in the agent script is the number of fixed updates within an episode.

I know that this is confusing, sorry.

aideb · May 11, 2021

@andrewcoh_unity I've reworked my code and changed the
batch_size
and
buffer_size
values in the config file to
128
and
1024
respectively and now my episode lengths are all above 0. However, my mean reward seems to start out at around 0.9 at the very beginning of the training and stay mostly the same thorough the training instead of increasing (the standard deviation starts at and stays at around 0.135). My rewards system has remained unchanged except the punishment for being too far from the target is now -0.1 instead of -1. Entropy seems to have an overall trend of decreasing except for a spike here and there. What could be the cause of the mean reward not increasing?

Updated code:

Code (CSharp):

using System.Collections;

using System.Collections.Generic;

using UnityEngine;

using Unity.MLAgents;

using Unity.MLAgents.Sensors;

public class BulletAgent : Agent

{

public GameObject target;

public float strength = 50f;

Rigidbody agentRigidbody;

EnvironmentParameters defaultParams;

public override void Initialize()

{

agentRigidbody = gameObject.GetComponent<Rigidbody>();

defaultParams = Academy.Instance.EnvironmentParameters;

}

public override void CollectObservations(VectorSensor sensor)

{

sensor.AddObservation(target.transform.position);

sensor.AddObservation(gameObject.transform.position);

}

public override void OnActionReceived(float[] vectorAction)

{

transform.position = new Vector3(Random.Range(target.transform.position.x - 10, target.transform.position.x + 10), Random.Range(target.transform.position.y - 4, target.transform.position.y +4), Random.Range(target.transform.position.z + 1, target.transform.position.z + 5)); //x is from -3 to +3, y is from -2 to +2

}

public override void Heuristic(float[] actionsOut)

{

actionsOut[0] = Input.GetAxis("Horizontal");

actionsOut[1] = Input.GetKey(KeyCode.Space) ? 1.0f : 0.0f;

actionsOut[2] = Input.GetAxis("Vertical");

}

void FixedUpdate()

{

var distanceFromTarget = Vector3.Distance(transform.position, target.transform.position);

if (distanceFromTarget < 4.0f)

{

SetReward(1f);

EndEpisode();

}

else

{

SetReward(-0.1f);

}

}

private void Update()

{

if (orientation.magnitude > float.Epsilon)

{

gameObject.transform.rotation = Quaternion.Lerp(gameObject.transform.rotation,

Quaternion.LookRotation(orientation),

Time.deltaTime * 10f);

}

}

public override void OnEpisodeBegin()

{

transform.LookAt(target.transform);

gameObject.transform.position =

new Vector3(Random.Range(target.transform.position.x - 5, target.transform.position.x + 5),

Random.Range(target.transform.position.y - 3, target.transform.position.y + 3),

Random.Range(target.transform.position.z + 1, target.transform.position.z + 4));

agentRigidbody.velocity = Vector3.zero;

var environment = gameObject.transform.parent.gameObject;

}

}

TensorBoard graphs are here.

andrewcoh_unity · May 11, 2021

It looks like the episode length is still <1 which shouldnt be. Is the agent spawning in the goal state (e.g. distanceFromTarget < 4.0f) and immediately reaching an EndEpisode?

aideb · May 11, 2021

Yes, that could easily happen.
I've modified the code to avoid situations like this - I've changed the setting of position in OnEpisodeBegin (line 69) to
 gameObject.transform.position = new Vector3(Random.Range(target.transform.position.x - 10, target.transform.position.x + 10), Random.Range(target.transform.position.y - 6, target.transform.position.y + 6), Random.Range(target.transform.position.z + 4, target.transform.position.z + 8)); 
and in OnActionReceived (line 30) to
 transform.position = new Vector3(Random.Range(transform.position.x - 4, transform.position.x + 4), Random.Range(target.transform.position.y - 3, target.transform.position.y +3), Random.Range(target.transform.position.z + 1, target.transform.position.z + 3)); 
and now my episode length is above 0 but the mean reward still doesn't seem to increase.

ervteng_unity · May 12, 2021

Looks like your batch/buffer size is also super small (10/100). I'd try copying the hyperparameters from something similar - maybe the Pushblock or FoodCollector examples?

aideb · May 12, 2021

ervteng_unity said: ↑

Looks like your batch/buffer size is also super small (10/100). I'd try copying the hyperparameters from something similar - maybe the Pushblock or FoodCollector examples?
Click to expand...

Hi @ervteng_unity, I have changed batch and buffer size to 128 and 1024 respectively, but the issue still persists.

ervteng_unity · May 14, 2021

Try even bigger - 512 and 5120, for instance.

Also, I noticed that your OnActionReceived method just randomizes the position of the agent - is this intentional? This method should move the agent based on the received actions - right now the agent is just randomly moving around and has no control over its position.

aideb · May 17, 2021

@ervteng_unity, is this code better? When I tried training this, mean reward went from around 0.08 to around 0.2, but then fell back again to around 0.1 and stayed there for the rest of the training (around 7 million steps in total, config file the same as before).

Code (CSharp):

using System.Collections;

using UnityEngine;

using Unity.MLAgents;

using Unity.MLAgents.Actuators;

using Unity.MLAgents.Sensors;

public class BulletAgent : Agent

{

public GameObject area;

public GameObject target;

public bool useVectorObs;

Rigidbody m_AgentRb;

public override void Initialize()

{

m_AgentRb = GetComponent<Rigidbody>();

m_statsRecorder = Academy.Instance.StatsRecorder;

}

public override void CollectObservations(VectorSensor sensor)

{

if (useVectorObs)

{

sensor.AddObservation(StepCount / (float)MaxStep);

}

}

IEnumerator GoalScoredSwapGroundMaterial(Material mat, float time)

{

yield return new WaitForSeconds(time);

}

public void MoveAgent(ActionSegment<int> act)

{

var dirToGo = Vector3.zero;

var rotateDir = Vector3.zero;

var action = act[0];

switch (action)

{

case 1:

dirToGo = transform.forward * 1f;

break;

case 2:

dirToGo = transform.forward * -1f;

break;

case 3:

dirToGo = transform.up * 1f;

break;

case 4:

dirToGo = transform.up * -1f;

break;

}

transform.Rotate(rotateDir, Time.deltaTime * 150f);

m_AgentRb.AddForce(dirToGo * 1.5f, ForceMode.VelocityChange);

}

public override void OnActionReceived(ActionBuffers actionBuffers)

{

AddReward(-1f / MaxStep);

MoveAgent(actionBuffers.DiscreteActions);

}

void OnCollisionEnter(Collision col)

{

if(col.gameObject.name == "player")

{

SetReward(1f);

m_statsRecorder.Add("Goal/Correct", 1, StatAggregationMethod.Sum);

EndEpisode();

}

}

private void FixedUpdate()

{

var distanceToTarget = Vector3.Distance(transform.position, target.transform.position);

if (distanceToTarget > 10)

{

SetReward(-0.1f);

m_statsRecorder.Add("Goal/Wrong", 1, StatAggregationMethod.Sum);

EndEpisode();

}

}

public override void Heuristic(in ActionBuffers actionsOut)

{

var discreteActionsOut = actionsOut.DiscreteActions;

discreteActionsOut[0] = 0;

}

public override void OnEpisodeBegin()

{

var agentOffset = -15f;

var blockOffset = 0f;

m_Selection = Random.Range(0, 2);

transform.position = new Vector3(target.transform.position.x + Random.Range(-3f, 3f),

target.transform.position.y + 1f, target.transform.position.z + Random.Range(1f, 2f));

transform.rotation = Quaternion.Euler(0f, Random.Range(0f, 360f), 0f);

m_AgentRb.velocity *= 0f;

m_statsRecorder.Add("Goal/Correct", 0, StatAggregationMethod.Sum);

m_statsRecorder.Add("Goal/Wrong", 0, StatAggregationMethod.Sum);

}

}

ervteng_unity · May 18, 2021

How does the agent behave?

Also note that if you do something in OnCollisionEnter, it will only happen once - so the agent will have to go in and out of the area to achieve more reward.

aideb · May 18, 2021

@ervteng_unity The agent is supposed to follow the player's movements and try to collide with the player. The reward system is: +1 for each collision with the player, -0.1 for every time the distance between the player and the agent becomes greater than 10, -1/MaxSteps as an existential penalty (I've changed this from the very first post, as well as most of the code ).

The agent does follow the player and collide with the player, but the mean reward doesn't go above 0.2 when training

Search Unity

ppo agent mean reward decreasing/not increasing

aideb

andrewcoh_unity

Unity Technologies

aideb

aideb

andrewcoh_unity

Unity Technologies

andrewcoh_unity

Unity Technologies

aideb

andrewcoh_unity

Unity Technologies

aideb

ervteng_unity

Unity Technologies

aideb

ervteng_unity

Unity Technologies

aideb

ervteng_unity

Unity Technologies

aideb

Search Unity

Unity ID

Useful Searches

ppo agent mean reward decreasing/not increasing

Unity Technologies

Unity Technologies

Unity Technologies

Unity Technologies

Unity Technologies

Unity Technologies

Unity Technologies