Search Unity

Early attempt at simple ML Agents, looking for feedback

Discussion in 'ML-Agents' started by wx3labs, Apr 4, 2021.

  1. wx3labs

    wx3labs

    Joined:
    Apr 5, 2014
    Posts:
    77
    I've just started experimenting with ML-Agents in the past few days and have some questions. My current experiment is pretty simple and loosely based on the balancing ball example:

    A bunch of "ships" on a 2D plane, trying to reach the origin (0, 0).

    Discrete Actions:
    • Apply torque (left, none, right)
    • Apply thrust (none, full)
    Observations:
    • Position
    • Euler angle.y
    • Rigid.velocity
    The rewards:
    • +1 for getting close to the origin (resets agent)
    • -1 for getting too far (resets agent)
    • Very small reward/penalty for how close/far we are from origin
    • Very small reward/penalty for whether our velocity (dot product) is pointed at the origin
    • Very small penalty for absolute value of angular velocity
    When I start training they drunkenly spin around in circles, then very gradually improve. After 1 million steps they still spend most of their time spinning in circles and every once in a while one will head to the center. After 2 million steps I stopped the training. At that point the agents are usually finding their way to center, but not directly:

    MLAgents1.gif

    Questions:
    • Is 2 million+ steps a reasonable training time to get to that point, or does it suggest something's way off in my model? Does it depend on the hardware (machine is 5 years old)?
    • Any obvious things you would suggest to improve their behavior?
    • How do you decide whether to tweak rewards vs. tweak hyperparameters vs. run longer sessions?
    Thanks for any suggestions!
     

    Attached Files:

  2. ruoping_unity

    ruoping_unity

    Unity Technologies

    Joined:
    Jul 10, 2020
    Posts:
    134
    > Is 2 million+ steps a reasonable training time to get to that point, or does it suggest something's way off in my model? Does it depend on the hardware (machine is 5 years old)?
    How many steps needed for the model to train isn't affected by the hardware but the difficulties of the task. However the time needed to run 2M step is very dependent on your machine's compute power. 2M+ steps is not a crazy number for RL training but given the task seems fairly simple I'd say there might be some issues in your setup.

    > Any obvious things you would suggest to improve their behavior?
    I'm not complete sure about the task objective from your description but my understanding is that you're trying to make the ships to go straight towards the origin. If that's the case your reward shaping might be a bit misleading.

    You're giving small penalty for angular velocity to tell it to go straight. But also you're giving small reward for "getting close" to the goal. The agent might exploit that small reward and stay close to origin but not end the game. That's because the expected total reward when it gets to origin is +1, but the expected total reward when being close to origin is remaining_steps*small_reward +1. A better way to tell the agent to get to the goal as soon as possible is to give a small penalty on each step, and only give it reward when it gets to the goal. So since it gets penalty for every step it stays alive, it will try to end the game quickly.

    > How do you decide whether to tweak rewards vs. tweak hyperparameters vs. run longer sessions?
    The agents learn solely based on the reward signals and it . I'd say among the three the most important is to make sure your reward is really telling the agent what's the ideal behavior.

    You'll want to run longer if you see your training is still in progress, for example the rewards are still in the trend of going up or the loss is still going down. Tweak each hyperparameters has different effect on the training and that varies from case to case. For example if your training is very unstable you might want to increase the buffer size, or you might need to scale up the network size if the task is more complicated.
     
    wx3labs likes this.
  3. wx3labs

    wx3labs

    Joined:
    Apr 5, 2014
    Posts:
    77
    Thanks, that's very helpful! Changing the reward to discourage dawdling produced some obvious improvements.