Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

How does ML decide value for continuous action space?

Discussion in 'ML-Agents' started by Sherlore, Jun 22, 2021.

  1. Sherlore

    Sherlore

    Joined:
    Feb 26, 2014
    Posts:
    13


    Hello, I am new to Unity ML.
    I tried several experiments on my ragdoll project. I attempt to make the ML agent could control the ragdoll to chase the target. Basically, this project is edited from the example project Walker.

    But several experiments gave the result that my agents seldom make "Motion", such as striding the both legs to run. The agents usually seems to decide a "Form" and only use slight change to perform action. For example, in my video, the agent only use its left leg to jump, and never try to stride its right leg. And even the left leg is not making full stretching.

    So I am wondering how does ML decide value for continuous action space?

    The first hypothesis is it just randomly tries actions for all training and keeps learning from the results.
    The second hypothesis is when agent found an action set that looks good, it would start to modify from this action set like +-0.2 on actions and makes learning from the results.

    I want to ask which one is correct? or none of these hypothesis are correct?
    And how could I make my agent more agile in motion instead of being rigid in specific pose?

    Another question is my project only need [-1, 1] as range for action.
    My project uses Math.Clamp(action, -1, 1) for all action values as project Walker.
    But does it mean that agent actually tries the value that over the range of [-1, 1]?
    Does this make agent explore action space less efficient? Could I limit the range of action space for ML?

    Many thanks,
    Sherlore
     
  2. ervteng_unity

    ervteng_unity

    Unity Technologies

    Joined:
    Dec 6, 2018
    Posts:
    150
    What you're describing is a classic problem in ML - the tradeoff between exploration and exploitation. Basically, ML-Agents will randomly select actions, and gradually encourage the actions that result in a higher reward more. The rate of this exploitation is controlled by the learning rate, and is slowed by the beta parameter (which encourages randomness).

    So, if it discovers one type of gait, it will likely use that, unless you encourage additional exploration by increasing beta or increasing the batch size (effectively reduces learning rate).

    As for the actions coming out of the policy, they are already clipped between [-1, 1] so additional clipping is probably not required.
     
    Sherlore likes this.
  3. Sherlore

    Sherlore

    Joined:
    Feb 26, 2014
    Posts:
    13
    Thank you for your clear and useful explanation. It helps me a lot!
     
  4. ervteng_unity

    ervteng_unity

    Unity Technologies

    Joined:
    Dec 6, 2018
    Posts:
    150