Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice
  2. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

My Agent doesn't try all possible Discrete Actions

Discussion in 'ML-Agents' started by CodeMonkeyYT, Nov 19, 2020.

  1. CodeMonkeyYT

    CodeMonkeyYT

    Joined:
    Dec 22, 2014
    Posts:
    124
    Hey everyone,

    I'm trying to make a simple Car Driver Agent that follows a track but I'm having trouble getting it to work.
    The main issue seems to be that the Agent never tries all possible actions, it always ends up repeating the same actions over and over again so the reward never improves.

    I have two Discrete Action vectors, each with 3 possible values.
    Accelerate, DontMove, BrakeReverse
    TurnLeft, DontTurn, TurnRight

    From what I understand the way machine learning works is by first testing out all actions at random and then seeing which of those random Actions results in a good reward.
    So I would expect the agent to try all combinations of those Actions in order to find one that seemed to work.
    But what I'm getting is the Agent mainly just does a single action which is different every time I run the game.

    Here's the simple track scenario


    Here's one run where it tried Reversing to the Left 70 thousand times but didn't try Accelerating even once.


    In this one it just tried Accelerating forward which gets the agent to the first bend but it never tried Accelerate and TurnRight.


    I've tried using the default config file and I've tried using this one as well as changing tons of parameters.
    Code (CSharp):
    1. behaviors:
    2.   CarDriver:
    3.     trainer_type: ppo
    4.     hyperparameters:
    5.       batch_size: 256
    6.       buffer_size: 10240
    7.       learning_rate: 1.0e-5
    8.       beta: 5.0e-4
    9.       epsilon: 0.2
    10.       lambd: 0.99
    11.       num_epoch: 3
    12.       learning_rate_schedule: linear
    13.     network_settings:
    14.       normalize: false
    15.       hidden_units: 512
    16.       num_layers: 3
    17.     reward_signals:
    18.       extrinsic:
    19.         gamma: 0.99
    20.         strength: 1.0
    21.     max_steps: 500000
    22.     time_horizon: 2048
    23.     summary_freq: 5000
    I've set the reward based on distance traveled along the track.

    And the Observations that I'm using are
    - Current Position
    - Next Checkpoint Position
    - Raycast Distance to Wall At Angle 0
    - Raycast Distance to Wall At Angle +45
    - Raycast Distance to Wall At Angle -45

    So my issue is I don't even know what is wrong, I've tried messing around with pretty much every single parameter and I cannot get the Agent to try doing all actions in order to find the ones that work.

    Thanks!
     
  2. andrewcoh_unity

    andrewcoh_unity

    Unity Technologies

    Joined:
    Sep 5, 2019
    Posts:
    162
    Hi @CodeMonkeyYT

    It is strange that the agent isn't exploring the action space more fully. It may be hard to diagnose without looking at the code but I think there are a few setup things to clarify which may give us more information into what's causing the issue.

    A few questions:

    Does your environment have end conditions i.e. does the agent episode restart if it runs off the track/into a wall or hits a max step?

    Can you describe your reward function more explicitly?

    A few things stand out to me here:

    (1) The learning rate and beta seem a little low. I'd recommend initially trying .0003 and .001, respectively. Increasing beta incentivizes the agent to behave more randomly which may help your issue but I suspect there's something else at play here.

    (2) The time horizon seems a little large for this problem. I'd try turning that down to 1000.

    (3) The observations of current position and next checkpoint position strike me as a little odd if the agent's only goal is to move as far forward as possible. If the objective is just to move forward and not hit the walls, the raycasts should definitely be enough.
     
  3. CodeMonkeyYT

    CodeMonkeyYT

    Joined:
    Dec 22, 2014
    Posts:
    124
    Yes the episode ends and restarts upon hitting a wall

    Right now my reward function gives +1 for each checkpoint and (-time * .1f) in OnActionReceived (to discourage the agent from standing still until the time ends)
    I've also tried -10 for hitting a wall but it didn't seem to help.

    I've also tried adding the Curiosity parameter but it didn't seem to do anything
    Code (CSharp):
    1.       curiosity:
    2.         strength: 1.0
    3.         gamma: 0.99
    4.         encoding_size: 128
    5.         learning_rate: 3.0e-4
    Is it better to have fewer or more observations? I also tried adding the current rotation angle, transform.forward and direction to next checkpoint but again didn't notice any difference.

    Maybe the issue is simply in terms of volume of steps? For this kind of problem how much do I need to train before I see any kind of results?

    I'll try the values you mentioned and see if it helps, thanks!
     
  4. CodeMonkeyYT

    CodeMonkeyYT

    Joined:
    Dec 22, 2014
    Posts:
    124
    Should I be using Stacked Vectors? Currently have it set to 1

    What about Decision Period? I have it set to 5 with Take Actions Between Decisions enabled.

    If I set a beta of 1.0 shouldn't that mean that it behaves completely randomly which should guarantee that it hits all possible actions?
    Even with that I still have the same issue, it doesnt explore more than 2-3 possible actions.
     
    Last edited: Nov 19, 2020
  5. andrewcoh_unity

    andrewcoh_unity

    Unity Technologies

    Joined:
    Sep 5, 2019
    Posts:
    162
    I'm a bit suspicious of `(-time * .1f)`. What is the value of time? It's possible that if this reward is too negative, the agent is learning to end episodes as quickly as possible to minimize penalty incurred. Are your episode lengths really short?

    Since the agent can end the episode by hitting the wall, it may even make sense to give a survival bonus. However, to simplify this problem, can you just try administering reward for forward velocity per timestep? This way, the agent is (1) encouraged to move forward quickly and (2) stay alive so that it can keep getting reward for moving forward quickly.

    How sparse are your checkpoints? If they are very far away from each other, curiosity could help but I'm not sure. My feeling is the agent should learn to discover checkpoints.

    As far as observations, it's not necessarily true that fewer is better (True, fewer is better for computational cost). What's important in observations is that they capture a sufficient amount of information so that an agent can decide what to do next. From what I understand, it seems the agent needs to learn to move forward and not hit any walls. To do this, the agent needs to know where the walls are relative to itself and which direction is forward. The raycasts and a fixed vector pointing forward may be all you need for this. With this in mind, going back to the above reward suggestion, you can use velocity and maybe the dot product between the agents forward vector and the fixed forward vector.

    I don't think stacked vectors are necessary for this. Those are helpful when the agent needs a very short memory of past events. Decision period is the number of fixed updates to elapse between decisions. 5 is reasonable and I would caution against tuning this until you have a real need to have the agent make more decisions in a given amount of time.
     
  6. andrewcoh_unity

    andrewcoh_unity

    Unity Technologies

    Joined:
    Sep 5, 2019
    Posts:
    162
    As far as training volume, if the agent doesn't start showing an intention to move forward/avoid walls after maybe 70k timesteps something is probably off. I'm just basing this on my judgement of your problem though. It's possible that it's more complicated than I'm understanding and maybe would require more time.
     
  7. ShirelJosef

    ShirelJosef

    Joined:
    Nov 11, 2019
    Posts:
    21
    I had this weird issue with continuous control and when I changed normalize: false to true, it solved it.
    Please tell me if that helped
     
    Last edited: Nov 20, 2020
    LefaMoffat and CodeMonkeyYT like this.
  8. CodeMonkeyYT

    CodeMonkeyYT

    Joined:
    Dec 22, 2014
    Posts:
    124
    Many thanks! Setting normalize to true did it! Now every time I run the agent does indeed always try out all possible actions on roughly the same rate as I would expect.


    The penalty I have is AddReward(-Time.fixedDeltaTime * .1f); whereas a checkpoint gives +1.
    I'll try without that and with normalize and see if the agent no longer gets stuck on a never ending rotation.

    The one thing I did previously that greatly helped was indeed making checkpoints much closer. Initially I had them about 10 units and when put more of them 1 unit apart then the agent started to learn.

    Is it also possible that this is the kind of problem that would greatly benefit from imitation learning? I'm currently looking into how that works.
     
    LefaMoffat likes this.
  9. CodeMonkeyYT

    CodeMonkeyYT

    Joined:
    Dec 22, 2014
    Posts:
    124
    Setting normalize to true was indeed the key to solve my problem.
    With that and after spending a few hours training and setting up checkpoint positions I now have a working AI Driver.

    Thank you both!
     
  10. ShirelJosef

    ShirelJosef

    Joined:
    Nov 11, 2019
    Posts:
    21
    @andrewcoh_unity
    Seem to me that in certain situations without normalizing the network collapse ("converge") really fast to some weird local minimum. It happened to me in continuous control and for codemonkey in discrete control.
    Maybe it is a learning rate issue, but anyway, that is kind of problematic.
    What is the best way to address this? open bug in github?

    @CodeMonkeyYT
    Do you have by any chance the entropy graph of the experiment with normalize = false?
     
  11. CodeMonkeyYT

    CodeMonkeyYT

    Joined:
    Dec 22, 2014
    Posts:
    124
    Hmm looking at the Entropy it seems that with normalize false it was falling to 0
    Whereas with it set to true it seems to stay stable as it consistently gains more rewards.
    I was running the training with a different name every time so it's kind of hard to analyze the graphs
     
  12. andrewcoh_unity

    andrewcoh_unity

    Unity Technologies

    Joined:
    Sep 5, 2019
    Posts:
    162
    Glad to see everything works!

    It's hard to tell whats going on from inspecting the graphs.

    I do not think this is a bug. My guess is that the values of some of the vector observations were maybe too large (or too different) and dominating the activations of the network. Normalization maps everything to mean 0. It is strange though and I'll think more about it.
     
  13. TASIHMI

    TASIHMI

    Joined:
    Dec 3, 2021
    Posts:
    4
    How can I do this , "Setting normalize to true". When I try to train my object, it's only rotating, not considering other cases.
     
  14. smallg2023

    smallg2023

    Joined:
    Sep 2, 2018
    Posts:
    141
    in the .yaml under network settings
    i.e.
    network_settings:
    normalize: true