Search Unity

Question Agent not learning, Cumulative Reward plot is a horizontal line in Tensorflow

Discussion in 'ML-Agents' started by Kikkoceccato, Nov 8, 2023.

  1. Kikkoceccato


    Sep 6, 2023
    Greetings. I have spent the last three months crafting the environment for an Agent which is basically a six-wheeled rover.
    The Rover game object has a Rigidbody attached to it, and uses Wheel Colliders to move (each wheel has a unique collider). The "Is Kinematic" flag of the Rigidbody is set to false.

    The rover has a battery and a water tank. The battery drains at a fixed discharge rate as time passes, and water in the tank can be dumped on the ground.
    The rover has three axles, but only the front wheels can steer. It has the following Continuous Actions:
    • Vertical axis (forwards/backwards motion)
    • Horizontal axis (left/right steering)
    • Brake pressed (it's not really continuous: as in one of the examples in the official guide, the code is
      Code (CSharp):
      1. Input.GetKey(KeyCode.Space) ? 1.0f : 0.0f;
    • Water dumped on the ground (Left Shift key pressed, as in the previous example)
    I am trying to train the Rover to complete a circuit lap. The circuit has several checkpoints, and all of them must be reached in order to finish and be able to start a new lap. (The checkpoint is the equivalent of a circle with a 1m radius).
    The starting point is fixed, and the rover gets relocated there at the beginning of each Episode. Also, the rover is initially facing towards the checkpoint.

    These are the rewards:
    • Increasing reward that grows with the number of checkpoints reached (starting with 5, adding 5 every time)
    • At a fixed rate (every 1 second), a check is performed to see if the rover is closer to the next checkpoint than it was during the previous check. If it is, then the reward (LastClosestDistance - CurrentDistance) is added and the closest distance up to that point is updated. (This should increase the speed)
    • An additional negative reward if the speed is constantly low, during each Update:
      Code (CSharp):
      1.         var speed = Mathf.Clamp01(rb.velocity.magnitude / MAX_SPEED);
      2.         if(speed > 0 && speed < 0.1)
      3.             rover.AddReward(-0.1f);
    • During each Update, a Dot product that indicates whether the rover is facing the checkpoint or not (using the normalized forward vector of the rover and distance vector, as usual) is calculated. However, this function is used to calculate the reward, where "dot" is the dot product:
      Code (CSharp):
      1.     float Reward(float dot)
      2.     {
      3.         if(dot > 0)
      4.             return 1-Mathf.Sqrt(1-Mathf.Pow(dot,2));
      5.         else
      6.             return Mathf.Sqrt(1-Mathf.Pow(dot,2))-1;
      7.     }
    This is the plot of the function above. This mechanism should help the rover learning to face its target when moving towards it.

    These are the observations:
    • Dot product, with respect to next checkpoint (see above)
    • Magnitude of the 2D distance vector between the rover and the next checkpoint
    However, there are other six observations (one of them involving a BufferSensorComponent) which are not useful in order to complete the circuit - they will be needed in the future, as doing laps around the circuit is not going to be the only task to complete. My point is that a sudden change in the order of the observations in the observation vector implies that a new model will have to be trained (which is precisely what must be avoided at all cost).

    This is the plotted Cumulative Reward graph in TensorFlow after 400K steps:
    As you can see, the learning process is unstable and the rover doesn't really seem to be learning anything whatsoever: in the Scene view, the rover is not really heading towards the checkpoint.
    This is not the first training run: previous runs with slightly different observations and rewards produced more satisfactory graphs, and the rover could be seen actually trying to reach the checkpoint. (In particular, when the rover was initially facing North at the beginning of each Episode, and not directly towards its target).

    Do you have any suggestions? Does the ineffectiveness of the training have to do with the large amount of redundant observations? Thanks in advance.
    Last edited: Nov 8, 2023