Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice
  2. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

Change Decision Period during training

Discussion in 'ML-Agents' started by hk1ll3r, Dec 30, 2021.

  1. hk1ll3r

    hk1ll3r

    Joined:
    Sep 13, 2018
    Posts:
    88
    I'm training agents using poca. The game is a soccer style game (https://hk1ll3r.itch.io/football-io)

    It takes a long long time for the agents to pick up the basics of running after the ball and kicking it in the right direction. Initially I'd like to set the decision period to a larger value like 5 so that the agent learns the actions faster and near the end of training set the decision period to 1 so that the agent learns the accuracy.

    Is that supported? Is it a valid thing to do? I have set it up with curriculum parameters and it kind of works when I try it. I'm not sure if this is supported or if there is a better way of achieving the same thing.
     
  2. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    473
    Technically, changing decision intervals during training is not a problem. You're already aware that different intervals influence how agents learn, since observations are sparser at larger intervals (1). Rather than just switching from 5 to 1, I would lower the value gradually though. If you're writing your own decision requester logic, you might also try including the decision interval itself in the agent actions. This way, the agent can decide how much accuracy it needs in a given situation. This must not directly influence the rewards however - you would need to divide your rewards by the current decision interval (2).

    EDIT Just to clarify:

    1) Agents might learn faster at larger intervals because there is less data to be exchanged and processed during an episode. But agent behavior will also be different - for example, if a robot was trained to articulate its arm with an interval of 5, it might miss its target articulation after suddenly switching to 1, since it is expecting that observations and actions still apply to 5 update steps.

    2) That is if you were adding rewards at fixed update steps, like after every action. Without dividing the rewards, the agent could maximize them per decision step by simply choosing a larger interval.
     
    Last edited: Jan 2, 2022
    hk1ll3r likes this.
  3. hk1ll3r

    hk1ll3r

    Joined:
    Sep 13, 2018
    Posts:
    88
    Thanks @mbaske for the clarifications. That's what I'm doing to gradually reduce the decision interval. One thing that bothers me is that in the graphs the entropy and other metrics jump when the decision interval changes. I assume that's ok.

    I am a bit unclear about the second point you make. A higher decision period means the agent makes an observation and takes an action every nth step. But in each intermediate step, it keeps taking that same action. (when the "Take actions between decisions" is checked as is in my case). So from the learning algorithm's point of view, the agent took an action in every step (alas the same action). So the rewards per step per action defined in "OnActionReceived" of Agent script should suffice. Is that not the case?

    Also are you part of the Unity ml-agents team or an enthusiast like me? Asking since I'm spending a lot of computation time no training and cannot afford big mistakes.

    Thanks again!

     
  4. hk1ll3r

    hk1ll3r

    Joined:
    Sep 13, 2018
    Posts:
    88
    @mbaske Reading through the manual I see what you are saying. The learning algorithms treats decisions as "actions in the RL terminology" and assigns all the rewards for all the actions in between decisions to the latest decision.

    Have you tried letting the agent decide its own decision period? Any gotchas?
     
  5. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    473
    Yes, I did a few experiments, but no detailed analysis. My agents were converging their average decision interval towards some value over time. However, it kept fluctuating quite a bit between individual steps. So my guess is agents were adapting the interval to the situation at hand - can't proove that though.

    I'm not part of the team, just a fan ;) Yes, I wish there was some shortcut for avoiding long trial runs, but I haven't found one yet. I usually copy the trainer config yaml from one of the demo environments that I feel best matches my project, and edit it slightly. Then I send as many custom metrics as possible to the stats recorder, so I can check tensorboard and see if things go sideways early on.