Search Unity

Why is agent having such a hard time learning this behavior?

Discussion in 'ML-Agents' started by kurismakku, Jan 21, 2021.

  1. kurismakku

    kurismakku

    Joined:
    Sep 18, 2013
    Posts:
    66
    Hi everyone, I have been enjoying trying out ml agents. This is what I am trying to achieve.

    Basic goal: the agent should learn to bring gold from mine to base. He needs to be able to do that back and forth, take 1 gold from the base, and bring it back to the mines.
    Intermediate goal: the agent should take the shortest path between the mines and the goal since that's more efficient.

    So far I tried a few different approaches to achieve this, and I am not sure if the problem is in the script/reward system or some other settings.
    Video of how the agent behaves:
    https://www.dropbox.com/s/ccgw4gnf1gbmwkt/2021-01-22 00-13-54.mkv?dl=0

    If the agent is given to observe Vector3, does he internally understands that this specific Vector3 belongs to an object? Because if he doesn't, how can he ever learn to follow this Vector3 if this Vector3 is changing the position? If he is observing multiple Vector3 of multiple objects, how can he ever make a difference between those vectors?

    I tried few approaches so far, one was to give him a reward only after he visits both the mine and the base. He would get a reward for every visit, but also his life would extend so he can do more connections. In case he doesn't succeed to make a pair on time, he would die because his life would expire. He doesn't receive a negative reward for dying, but he does receive a small negative reward while he is alive. I tried the end the episode after 1 delivery, but when testing, the agent wasn't able to do multiple deliveries.
    The distance of the mine and the base is always the same compared to the starting agent position, to make the training reward structure more consistent.
    I also tried adding additional observers to the agent, like measuring the distance between the mine and the agent, measuring the distance between the base and the agent, I try to tell him which type of collision happened, who was the object he collided with, etc. I am not sure is these needed, but without it, it didn't work either.

    The code of the agent:
    https://pastebin.com/ps1b5cp4

    Settings:
    https://www.dropbox.com/s/smm84x2q9003iw0/agent settings.png?dl=0

    I think the problem might be in the settings, in initial tests there were many mines and bases being present, so I increased the vector observation space size - by 3 for every mine and base since I had to track their Vector3 positions. Also, just noticed I changed the vector action space size to 4, this should be 2 from my understanding since the agent is controlled by 2 inputs x and z. But in an earlier test, I had it on 2, and the agent still had problems.

    behaviors:
    Farmer:
    trainer_type: ppo
    hyperparameters:
    batch_size: 10
    buffer_size: 100
    learning_rate: 3.0e-4
    beta: 5.0e-4
    epsilon: 0.2
    lambd: 0.99
    num_epoch: 3
    learning_rate_schedule: linear
    network_settings:
    normalize: false
    hidden_units: 128
    num_layers: 2
    reward_signals:
    extrinsic:
    gamma: 0.99
    strength: 1.0
    max_steps: 500000000000000000000000000000
    time_horizon: 64
    summary_freq: 10000

    I hope someone can help, thanks in advance!
     
    Last edited: Jan 21, 2021
  2. kurismakku

    kurismakku

    Joined:
    Sep 18, 2013
    Posts:
    66
    I tried with Vector Action Space Size = 2, 2.5 milion steps, this is the result:
    https://www.dropbox.com/s/de62nd0jrimb9ah/2021-01-22 20-12-32.mkv?dl=0

    Now I am trying it again with a smaller distance, and with Vector Observation Space Size set to 50. So the idea with this is that the agent can reflect on the whole path that got him to that point, not sure will it help though. Hoping that it will figure something like this: "If I moved in one direction without changing the direction much, I was able to approach the target quicker, and this gave me a bigger reward."
     
  3. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    Hi @kurismakku,

    At first glance I noticed a few things going on here -

    [EDIT] This one is incorrect, I skimmed the code too fast.
    Your agents observations are in both local (the agents position) and global (the mines/bases) coordinates. This means that every training session the agent must teach itself how to convert global positions to local ones along with teaching itself what euclidean geometry even is in the first place. I would recommend using positions relative to the agent and not using any local(object) or global ones.
    [/EDIT]

    You have a ton of observations (120) that seem to be un-used. I see 1 mine and 1 base in your video, there should be 6 observations or 2 vectors for these (plus the type of collectable stuff). Every padded (empty) observation you add is an observation that the agent will try to create a connection to an action with. The agent has no idea that these values don't actually mean anything so it will take a bunch of time (trial and error) to learn that they actually mean nothing. I would recommend using the simplest case - one mine, one base - and refining it until you are happy with the behavior, then add additional complexity slowly (1 at a time) to see what effect it has on the existing intended behavior.

    You have an obscenely high max_steps value but are using a linear learning_rate_schedule, you may as well use constant because the learning_rate and beta will decay so little over whatever step number you actually run training for. If you actually want those values to decay to help with convergence the max_steps needs to be set to the amount you want instead of setting is really high and stopping it manually.

    You're using continuous actions in a space that could easily be discrete, continuous actions require quite a bit more training iterations for the agent to figure out. PPO is especially bad at learning continuous action spaces. I'd recommend changing to a discrete action space with 2 actions, 3 branches. Action 1 - no move, move left, move right. Action 2 - no move, move up, move down. Much simpler, much easier for an agent to learn.

    Your batch and buffer size are way too small even for discrete action spaces, for continuous action spaces they should be something like batch_size: 1024 and buffer_size: 4096. Even that is probably a bit on the low end. See the docs for some rough value ranges.

    Normalize should be true for continuous control spaces, and should be false to discrete (unless it shouldn't but I wont get into that)

    You have no max step on the agent, from my understanding of your description you are reseting the agent when its 'life' runs out and extending it when it performs the correct actions using this as a sudo max step. This is a complicated relationship for the agent to understand, and I'm not sure what the unintended effects of that would be. My guess is that's why you're not seeing much hustle from the agent (inefficient pathing) but I can't be sure. I'd recommend just setting a sane max step and removing the life stuff. If you feel you need to keep it, at least give the agent an observation of its life so it can more easily learn the relationship.

    Your reward structure could be optimized however that's usually the last step. And the agent should train ok with the one you have.

    No, the agent starts training knowing nothing, the internal representation starts building during training entirely through trial and error as relationships between observation and action spaces. After training for several thousand steps the relationships that start to emerge begin to look a lot like our idea of basic euclidean geometry. This is assuming of course that the environment is programmed in a way that makes those relationships clear (see my first point about local/global vector observation mixing). The agent will not have any understanding of objects at all (and it doesn't need to), the agent only optimizes for the greatest reward (over time series) and within it's deep network will hold the relationships between observations and actions its connected. (this is not strictly accurate mathematically but a good conceptualization or mental model of the algorithms statistical innards)

    Hope this helps, let me know how it goes!
     
    Last edited: Jan 22, 2021
    ItzMeStellar and kurismakku like this.
  4. kurismakku

    kurismakku

    Joined:
    Sep 18, 2013
    Posts:
    66
    Thank you so much for your reply! It seems I have a lot to learn, looking forward to it. I will try these suggestions tomorrow.

    You mentioned that I am tracking mines/bases global positions, where is that exactly? I am using transform.localPostion everywhere in the code. Both agent, mine, and the base are the children of Floor.

    Thanks once again, I will update you about this in the coming days.
     
  5. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    Oh my bad, look's like I was mistaken on that one.

    I would still recommend not using local positions either and converting to agent relative positions, something like -

    sensor.AddObservation(this.transform.InverseTransformPoint(Mines[i].Position);


    It's basically the difference between me telling you go to 76°39'02"S 147°52'45"E versus telling you the destination is 12km north, 40km west of you, at elevation 1230m. You could probably figure out how to use the coordinates to get there eventually, but the directions relative to your location are much easier to intuit.
     
    ItzMeStellar likes this.
  6. SabriAbderrazzak

    SabriAbderrazzak

    Joined:
    Jan 1, 2022
    Posts:
    2




    Hello !! Have you been able to train this model ? or anything closer to it ? it would help a lot , Thanks