Search Unity

  1. Unity 6 Preview is now available. To find out what's new, have a look at our Unity 6 Preview blog post.
    Dismiss Notice
  2. Unity is excited to announce that we will be collaborating with TheXPlace for a summer game jam from June 13 - June 19. Learn more.
    Dismiss Notice

Question Request for advice on improving the Reward function

Discussion in 'ML-Agents' started by Leben06, May 9, 2023.

  1. Leben06

    Leben06

    Joined:
    Jun 10, 2017
    Posts:
    1
    Hello everyone,
    I would like to have some outside opinions on a problem I'm having with my ml-agents project. Based on the "Crawler" example from the Github repository, I'm trying to teach a 3D model, which happens to be a simulation of a robot I've built in real life, how to walk.
    https://github.com/EbonGit/ML_Robot

    robot.PNG

    The inputs are, like in the demo, the properties of the joints and the position of the target. As for the reward function, it's a value between 0 and 1 that takes into account the agent's speed and its orientation relative to the target, and -1 if it falls.

    As shown in the graphs below, the learning process seems to be working correctly, and the robot in thesimulation is performing its task correctly. However, the way it moves is too jerky for my taste. I believe the episode reward stagnates because it's doing what it's rewarded for.

    courbe1.PNG

    courbe2.PNG
    courbe3.PNG

    mlagents-learn config.yaml --env=UnityEnvironment --run-id=train --force --num-area=16

    My question is, do you have any suggestions on how to modify my reward function to make its movements more natural ?

    Thanks for taking the time to read, if you need more information, let me know.
     
  2. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    I've worked on a similar robotic quadruped locomotion problem and tried to address the 'jerkiness' problem. As you have probably found, it turns out there really isn't a simple solution to it. Here are some things I did that had some effect on the trained policy -

    Adding an 'energy cost' to each action that changes the joint positions.
    This change was by far the most complicated but gave me the best results. I gave the robot an energy bank for the whole episode and decreased it as the robot changed its joint angles, reducing by the magnitude of the change (larger changes cost more). When the robot ran out of energy I ended the episode and assigned it a -1 penalty. This ended up being too tough for the agent to deduce while learning to walk and it would often learn a bad policy of just standing still to conserve energy instead of moving forward toward the target. I addressed this by using a curriculum to start the agent with a very large energy bank until it had already learned a decent policy (it was successfully making it to the target's position most of the time). Then the energy bank was slowly decreased forcing the agent to be more conservative with its movements. This largely worked however it took a ton a tweaking and failed training runs to balance the curriculum + energy bank with the competency of the policy at any given point in training. I'm not convinced I ever found the right balance but eventually I ran out of time to experiment.

    Penalizing large changes in joint angles
    This change is a pretty easy one but doesn't produce a good gait. I managed to remove a lot of jitter but what I got instead was a robot that used unusual minimized jumping movements instead of walking. It makes sense conceptually after seeing the result, when the robot is in the air it can freeze its joint angles to one position to avoid the penalty while still moving forward. My conclusion was this would be a great way to train some sort of frog robot o_O

    Training for extremely large step amounts
    Whatever your max_steps are for this task multiply it x10. I saw some minor improvements in jitter just by massively increasing the step count, this will not fix the problem but if you add something like the energy cost outlined above for a more stable solution, the policy will need the extra steps anyway.

    Things I tried which did not work. -
    • Most reward sculpting techniques (only exception was the above simple joint angle change penalty)
    • Changing continuous actions on the joints to be discreet actions
    • Requesting actions less often (performance on the actual task tanked every time)
    Let me know what you try and if it was effective, I do plan to return to this when I can. It's a really fun training problem!
     
    Leben06 likes this.