Search Unity

  1. Good news ✨ We have more Unite Now videos available for you to watch on-demand! Come check them out and ask our experts any questions!
    Dismiss Notice
  2. Ever participated in one our Game Jams? Want pointers on your project? Our Evangelists will be available on Friday to give feedback. Come share your games with us!
    Dismiss Notice

Need some suggestions

Discussion in 'ML-Agents' started by Clonus, Jun 28, 2020.

  1. Clonus

    Clonus

    Joined:
    May 26, 2015
    Posts:
    64
    So, the bot I am attempting to train seems to not act as expected.

    Here are the reward categories.

    1) -1 if a bot discharges it's gun while not directly targetting the "target"
    2) +5 if they discharge weapon while directly targetting the "target"
    3) increasing reward(-1 to 1) depeding on angle towards direction of target (180 degress / 180)
    4) increasing reward (if they are within a set range) which increases if they move close to target location.

    as for observations I am sending details on the angle, the distance and wether or not it's optimal to discharge weapon.

    I find that the aiming seem pretty good, however after time the stop shooting.

    Also, for the most part they decide it's better to just stay stationary no matter where the target is and look towards it.

    here is my config.

    behaviors:
    EnemyBehavior:
    trainer_type: ppo
    hyperparameters:
    batch_size: 2024
    buffer_size: 20240
    learning_rate: 0.0003
    beta: 0.005
    epsilon: 0.2
    lambd: 0.95
    num_epoch: 3
    learning_rate_schedule: linear
    network_settings:
    normalize: true
    hidden_units: 512
    num_layers: 3
    vis_encode_type: simple
    reward_signals:
    extrinsic:
    gamma: 0.995
    strength: 1.0
    keep_checkpoints: 5
    max_steps: 10000000
    time_horizon: 1000
    summary_freq: 30000
    threaded: true
     
  2. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    231
    A couple of thoughts on this:
    Lower the overall reward range for more stable training, it shouldn't exceed -1/+1.
    Reward achieving goals, not behaviour. The point is to let the agent figure out an optimal strategy on its own. In this case, you could try only rewarding the agent for hitting the target. I'm aware this can be hard to train, because at first, hits are rare resulting in only sparse rewards. Using curiousity might help, but I'd rather go for some kind of curriculum: start training with a target that's fairly easy to hit, perhaps it's very close and angles are limited. Make the problem harder later-on, once the agent has learned what it needs to do.
    To prevent the agent from firing its weapon constantly, you might try limiting its ammunition. The ammo could restore itself at regular intervals, perhaps adding a bullet per second until the ammo count is back at maximum. The agent would then need to observe the current ammo count as a fraction current_ammo_count / max_ammo_count. If you penalize missed shots on the other hand, you'll need to make sure penalties are small enough as to not discourage the agent from shooting at all.
    Finally, make sure observations are in the agent's local space, rather than in world space.
     
  3. Clonus

    Clonus

    Joined:
    May 26, 2015
    Posts:
    64
    hey @mbaske appreciate the guidance here!! thank you so much...

    I'll give that a try however the process is unclear from the details above.

    Do you suggest I start with a simple task first, get that to work then add tasks to the same model and resume?

    Or you mean add tasks (goals) and start over again with each newly added goal so as not to make it too initially complex and adjust as I add new ones?


    Also, in terms of rewards, when you say -1 to 1, do you mean per reward or per call to the Action function.

    So after I receive the actions, I take action then calculate rewards. In the case where there may be multiple factors like direction, distance and firing a gun, to me, these are separate things. I reward for each one.

    As you said in your previous post, don't reward behaviour, I assume that most of those items above are behaviour. I can restrict my rewards to periodic distance calculations?

    thanks
     
  4. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    231
    No problem, I'm aware of three options:
    1) Start simple (e.g. target is easy to hit), and once tensorboard indicates that agents have successfully learned to hit them, pause training, increase difficulty and resume training. Repeat the process if required ("manual" approach).
    2) Provide a curriculum as described here https://github.com/Unity-Technologi...ocs/Training-ML-Agents.md#curriculum-learning
    In your use-case, the "parameters" config field could contain distance ranges, in which targets can spawn.
    3) Implement an automatic curriculum that adjusts difficulty depending on agent skill. Let's say you've set a maximum episode length and count the number of hits your agent scores per episode. Again, you'd start with nearby targets, but the range in which they can spawn in each new episode is controlled by the automatic curriculum. There would have to be some proportional dependency between hit ratio and spawn range. You might want to average the hit ratios over several episodes, in order to prevent the difficulty from fluctuating too much.

    The sum of all rewards per agent decision step shouldn't exceed -1/+1. If your decision period is 5 and you add rewards at every OnActionReceived call, then they should be small enough each (-0.2/+0.2) as to not add up beyond the -1/+1 limit, once CollectObservations is called again.

    Yes, I would try to make it as simple as possible at first and only reward target hits, assuming that's your agent's goal. Let the agent find out for itself what angles and distances are optimal, rather than telling it how to achieve its goal by rewarding behaviour.
    Or, if you want to speed up training, you could set a reward that's inversely proportional to the distance between the target and the position the bullet ended up in. This way, the agent gets feedback for every shot fired and should learn targeting pretty quickly. That might even dispense with the need for a curriculum.
     
  5. Clonus

    Clonus

    Joined:
    May 26, 2015
    Posts:
    64
    @mbaske Hey , following up, I tried what you suggested, it's working so much better. Enemies chase and discharge weapons at opportunity targets only. They won't fire amongst themselves even when clustered up.

    The next problem I face is obstacle avoidance. I've added blocks here and there and also what them to not push each other around and run in a cluster.

    What do you feel is a recommended solution here?

    In the OnCollisionEnter I am trying to addreward(x) x has alread been set static to .001 .01 and .1 but I don't seem to be getting any great results from that.

    Any suggestions on that?

    Also, any recommendations on when to use EndEpisode properly?

    thanks a ton for all your help!
     
  6. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    231
    Glad to hear it!
    How are you detecting obstacles? Are you using raycasts?
    Just to make sure: you mean penalties, right? So that would be -.001 -.01 and -.1
    Penalizing collisions should work.
    Again, there's a possible shortcut for training, but it's not always applicable. If you're using raycasts, you could set penalties in relation to the measured distances to obstacles - the shorter the distance, the bigger the penalty. This way, the agent should learn quickly that it needs to stay away from obstacles without having to collide with them many times. A possible problem here is that you might want your agent moving close to obstacles - perhaps in order to hide, take cover or simply because it's the shortest route. In that case, penalizing collisions is probably be better.
    Depends on the situation really. You'll need to be careful with penalties when you're letting the agent control EndEpisode() in open-ended episodes, that is if you set Max Step to 0. If your rewards & penalties sum up to a value < 0, the agent might decide its best strategy is to end episodes as quickly as possible.
     
    Last edited: Jul 1, 2020 at 2:36 PM
  7. Clonus

    Clonus

    Joined:
    May 26, 2015
    Posts:
    64
    "How are you detecting obstacles? Are you using raycasts?"

    Yes, I am using the Ray Perception Sensor 3D but I am decreasing rewards on the OnCollisionEnter method. I check to see what collision object it is and only decrease based on those objects.

    "Just to make sure: you mean penalties, right? So that would be -.001 -.01 and -.1"
    Yes negative.


    "Again, there's a possible shortcut for training, but it's not always applicable. If you're using raycasts, you could set penalties in relation to the measured distances to obstacles - the shorter the distance, the bigger the penalty. This way, the agent should learn quickly that it needs to stay away from obstacles without having to collide with them many times. A possible problem here is that you might want your agent moving close to obstacles - perhaps in order to hide, take cover or simply because it's the shortest route. In that case, penalizing collisions is probably be better."

    I tried this with success on a different model. I used an AnimationCurve that went from -1 to 1 and created a bowed curve. This seemed to work well and it would actually maintain distance since I drop the curve 20% near 1. So basically the closer it got the higher the reward however if it got too close there was a drop, so it learned to go to the location but keep it's distance.

    However, wouldn't this be considered a reward based on behaviour?


    "Depends on the situation really. You'll need to be careful with penalties when you're letting the agent control EndEpisode() in open-ended episodes, that is if you set Max Step to 0. If your rewards & penalties sum up to a value < 0, the agent might decide its best strategy is to end episodes as quickly as possible."

    Got it thanks

    It's retraining right now with this

    private void OnCollisionEnter(Collision collision)
    {

    foreach (var col in collision.contacts)
    {
    if (collision.gameObject.tag == "Wall" || collision.gameObject.tag == "Block" || collision.gameObject.tag == "Bot")
    AddReward(-1/MaxStep);
    }

    }

    Hopefully, this results in something useful...
     
  8. Clonus

    Clonus

    Joined:
    May 26, 2015
    Posts:
    64
    The above works well, its does a decent job of walking around objects to get to the target.

    so, now... @mbaske How do I train 2 models at the same time? :)
     
unityunity