Search Unity

  1. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

Question NEEDED: Advise regarding rewards and signals

Discussion in 'ML-Agents' started by Xysch, May 12, 2020.

  1. Xysch


    Sep 2, 2013

    I would like to thank you ahead of time for taking a look at my problem. To cover the basics, I am using Unity 2019.2.8, ML Agents version 0.14.1, and Python 3.6. I can update the ML Agents version if it is believed to help improve my results.


    For the environment, at a base it is an infinite runner. I would like my agent to jump through hoops which come along the track and avoid collision with the outer edges of the hoop. These hoops all have a random height set between a maximum and minimum and have varying distances between each other. For the training environment I have included 4 hoops to train on which change position and height every time the environment resets.
    There are 3 different trajectories in which the agent can go through the hoop:

    1. Reach the maximum height before the hoop and going through on the downward decent
    2. Reach the maximum height when going through the hoop
    3. Reach the maximum height after the hoop and going through on the upward slope

    Here's an crudely drawn image for reference:

    The ideal trajectory would obviously be 2, as this reduces the chance of collision with the edges. Cases where trajectory 1 or 3 should be chosen is when 2 different hoops are close enough to each other in which there would not be enough room to jump using trajectory 2. Here's a drawn example where the agent would need to jump using trajectory 3 in order to make the following hoop:

    Rewards and Signals

    I have been struggling to get my desired results and am unsure of where I am going wrong. At first I was using continuous signals, but quickly realized that the reward would be given much later than when the decision was made. This is due to the agent's decision to jump, and then not receiving a reward until after passing the hoop which takes some amount of time. I believe that the agent would not be able to make a correlation with such a delay.

    So I switched to discrete and now call for an observation when within the maximum and minimum slopes of the current hoop, and an attempt to jump has not yet been made. To get the minimum and maximum slopes, I recorded at which distances I was able to heuristically jump through the hoop. Something to note is that the lower hoops have a greater range of slope than the higher hoops. This is largely due to being able to jump through a lower hoop later and still being able to make it through, as well as having the ability to jump earlier with greater force. Here's a photo for clarification showing the difference of jumpable ranges between high and low hoops using trajectories 1 and 3:

    To start, I would like the agent to simply learn how to jump using trajectory 2 and then later on have it learn how to make multiple hoops close to one another. I set up my signals which total up to 11 observations as such:

    1. A Vector3 observation of the agent's position
    2. A float observation of the agent's current slope in relation to the closest hoop
    3. An int observation of the number of jumps the player has left(Would like to add double jump feature later on)
    4. A Vector3 observation of the closest hoop (hoop 1)
    5. A Vector3 observation of the second closest hoop (hoop 2)

    I set up the rewards as such for the training environment with 4 hoops:

    1. +0.25 for a perfect jump through hoop (no collisions with outer edges)
    2. +0.05 for partial jump through hoop (collision with an edge)
    3. -0.25 for not attempting to jump
    4. -0.15 for not colliding with a hoop but did jump

    I was able to achieve about 86% accuracy with this setup. The major problem I have encountered however was the agent never chose trajectory 2. It would always jump as soon as it entered the jumpable slope range and I would allow it to make observations meaning it would look like trajectory 1. Because of this, it would often miss the lower hoops. When I removed the lower hoops from the test scenario, we would achieve results near 100% accuracy.

    I have tried numerous things to combat this such as:
    1. Giving rewards for minimizing y-velocity (should be close to 0 for trajectory 2)
    2. Giving punishments for y-velocity outside set bounds
    3. Giving rewards for waiting to jump until the ideal slope
    4. Giving rewards for jumping linearly within +/-10% of the hoop's height
    5. Giving punishments for jumping outside +/-10% of the hoop's linear height
    6. Giving rewards for jumping within +/-10% of the ideal slope's distance from the hoop
    7. Giving punishments for jumping outside +/-10% of the ideal slope's distance from the hoop

    All of these have yielded the same results where the agent would jump as soon as it can make an observation. I've come to a point where I am unsure what else I can try to do. If any of my thought process or anything else seems flawed please let me know. I am willing to take any input or criticism I can.

    Thank you again and feel free to ask any questions.
    Last edited: May 12, 2020
  2. mbaske


    Dec 31, 2017
    I think you're trying to micro-manage your agent by rewarding behaviours, rather than achievements. The idea is to set a goal and let the agent figure out how to get there. Given a large enough time_horizon, the agent should be able to cope with late rewards and infer which past actions were required to achieve them. Rewarding behaviour imo is kind of like introducing a hidden heuristic, because you're telling the agent how to reach a goal, rather than what that goal actually is. What's your runners goal? Is it running speed? Is it reaching waypoints along the track? Or is it getting points by jumping through hoops? Are the hoops obstacles, or can the agent still achieve its goal by ignoring them?

    The agent should probably request observations and actions at some regular interval, maybe even at every step. If you're limiting the decision window to points that you deem critical, then again, you're telling the agent how to do its job. Make sure the observations are in the agent's local space and normalize them. The runner probably doesn't need to know about its own position, only about the relative positions of objects in its vicinity.