Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

Reward doesn't go up and Entropy goes up

Discussion in 'ML-Agents' started by Frassefrass, Feb 28, 2022.

  1. Frassefrass

    Frassefrass

    Joined:
    Jan 22, 2022
    Posts:
    1
    Hi,

    I'm currently trying to make an AI which can drive 3 cars in a open space without hitting each other or walls. Though when I'm training I'm not seeing any improvements. With a lot of training it gets a large negative mean reward with a high deviation. I've attached the training stats and my config.

    There are 3 agents driving on the same map and they get points if they go to previously unvisited places on the map. They get a small negative reward over time and a larger negative reward if they collide with any walls or other cars, if they stay collided with them they continue getting negative rewards.

    The agents have raycast so they can see walls and other cars, as well as having direct knowledge of its position and the position of other cars. The agents also have knowledge of the closest unexplored place and gets extra rewards if they go to that one. All of the inputs are normalized and normalize is set to true.

    The agents have 3 continuous output which control steering (-1 to 1), accelerate forwards (0 to 1) and accelerate backwards (-1 to 0).
    Screenshot 2022-02-28 223506.png Screenshot 2022-02-28 223613.png
     
  2. weight_theta

    weight_theta

    Joined:
    Aug 23, 2020
    Posts:
    65
    1. 2 million episodes it not enough to train an agent on a continuous control task, try something like 60 million. Driving is tuff.
    2. Since your agent will likely do terrible over the first few million epochs, negative rewards will accumulate, that is collisions or other bad actions will occur often, explaining why your reward is always negative.
    3. You can clip rewards, you can reduce the overall penalty so they dont accumulate as quick.
    4. Train for 60 million episodes, see what happens, if agent still does not learn, reduce the penalty. Still not working ? Clip the reward and then run it again. You are essentially rewarding shaping and trying to figure out, which reward schema works best for your task.
     
  3. ChillX

    ChillX

    Joined:
    Jun 16, 2016
    Posts:
    145
    True as @weight_theta said you do need to run for a much larger number of steps.

    However from what I can see you also have other issues at play.

    Let me try explain this in the form of a story from the point of view of the agent.

    Chances are its "probably" (not always) never going to learn.

    The same input is being provided from two sources which compute the values differently. Raycast and direct vector position. This "usually" (not always) results in confusing the agent.

    Agent - One second: Hey i got raycast hit that says other car is 5.6 meters away, but wait the direct input says 5.7 meters away. Which is it? Hmmm maybe I'll take both inputs with a weight of half and half.

    Agent - Next second:
    Hey i got raycast hitthat says other car is 5.6 meters away, but wait the direct input says 5.4 meters away. Which is it? This does not make sense. Now the average of 5.5 is less than the ray cast value

    Agent - Next second:
    Hey I got direct input that says the other car is 5.5 meters away, But wait the raycast hit is missing. So this really does NOT make sense cause now the average of 5.5 and 0 is 2.75

    Agent: OMG I'm really confused. Is the other car 5.5 meters away or 2.75 meters away ? These inputs are totally meaningless. Are you trying to confuse me on purpose ?

    The closest unexplored place may be on the other side of a wall. Or it may not be. Are you providing an extra 0 / 1 input to the agent to differentiate this ?

    Agent - One second: Yayyy unexplored place 7 meters north west. Lets rush to it and get reward. Runs straight towards unexplored place. Gets reward. Agent is happy.

    Agent - Next minute: Yummmm that last reward was a nice reward. Yayyyy next unexplored place 7 meters south west. Lets rush to it. After two meters crashes into a wall. "WTH" thats not fair. Ouch that hurts. I don't want these stupid unexplored place rewards anymore. I'm just gonna run away from them

    Agent - Next minute: Accidentally stumbles into unexplored place and gets reward. Hey wait a minute maybe these unexplored places are not that bad. Lets go for the next one

    Rinse and repeat. An endless cycle of learning and unlearning.
    Until eventually millions of wall crashes later it finally learns that it is supposed to drive around the wall. Or the model collapses to determinism (or gradients explode, etc...) before it manages to learn that.

    If all the inputs are normalized then set normalize to off. When turned on its going to skew your already normalized inputs using a running averages.
     
    Last edited: Mar 1, 2022
    iffalseelsetrue likes this.