Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

Question Package ml-agents: how reward system works, observed parameters, statistics manipulation, network co

Discussion in 'ML-Agents' started by legitimatesd, Jun 10, 2023.

  1. legitimatesd

    legitimatesd

    Joined:
    Nov 5, 2020
    Posts:
    2
    Good day. I apologize in advance for my English. It is not my native language, so I may make mistakes. I am using the 'ml-agents' package as a learning tool. After several weeks of working with it, I have a few questions for which I unfortunately couldn't find answers.

    1. When do we assign rewards? How does our model understand what it was rewarded for? For example, when DecisionPeriod = 1, it is clear that the reward was obtained at this step, and the observer recorded the values at which it was obtained, which makes sense. But let's say DecisionPeriod = 5, and the reward was obtained at the 3rd step. When we go back to CollectObservations, our parameters will be different from what they were at step 3. Seeing examples with such input, I became confused about how this works.

    2. Should positive and negative rewards go hand in hand? Let me give an example. If the bot crashes into a wall and we punish it, once it moves away and oncollisionexit is triggered, should we reward it, or is it sufficient to just punish it and show that such behavior is not desirable?

    3. How do we determine which data to include in the observer and which data might be unnecessary? For example, if I want to hit a moving target, is it enough to pass the target's position relative to my character and its velocity through AddObservation, or do I also need the distance since the shooting range is limited? Or another example: if a player crashes into a wall, and we penalize them for it, should we track not only the position of the character but also the angle at which it was facing?

    4. In TensorBoard, we can visualize data from our training. Is it possible to add an event marker to the rewards to indicate the event for which they were obtained, so that it can also be seen on the graph?

    5. What factors should we consider when configuring the network? What are the criteria for increasing the number of layers and neurons?
    I would appreciate any assistance.
     
  2. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    The best case for assigning rewards would be immediately when the agent achieves the desired goal. The longer the delay between behavior and reward, the more difficult it will be for the agent to correlate the action + observation space with the reward signal.

    The model doesn't ever 'understand' what it was rewarded for, training in gradient based policies only nudges probability distributions in a way that makes actions that (in specific observation contexts) give rewards 'more likely' to happen. The less complex the reward signal is the less nudges it takes to reach a good policy. The real 'gotcha' here is that it needs to be possible to reach some sort of reward signal with a totally random policy, otherwise training is probabilistically impossible.

    Now knowing that the policy updates only provide small nudges in probability how would the agent be able correlate any delayed rewards at all? This is were things gets complicated and a little intuition is required. The observed rewards, combined with the action probability, yields a reward signal. To determine the update direction that will improve the objective function, policy gradient methods rely on gradients ∇_θ (a vector of derivatives). With the reward signal and the update direction derived, we apply the learning rate to decide how much to update the policy.

    All that is the mathematical equivalent to saying that the delay between actions + observations and a delayed reward is learned by brute force and repetition. A policy without a delay in rewards learns slightly easier because it would conceptually have to sample less to find the optimal distribution that provides that actions in the context that gave the reward but a policy with a delay will still learn the optimal distribution given more sample steps.
     
    legitimatesd likes this.
  3. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    Both techniques work in practice.

    Which one will work better for your specific application depends on the task and how much 'control' you want over the end policy. Reward shaping is the term for giving rewards (positive or negative) periodically throughout an episode. Only punishing a bot for crashing into a wall may lead to a policy that never moves, so a positive reward given when it moves away will offset the negative signal and make training easier.

    The downside to reward shaping is your bias will be permanently engrained within the environment and may prevent any agent from discovering more optimal policies that you did not think of to complete a task. Also agents tend to find ways to 'cheat' reward shaping, in your example if the bot figured out a way to approach the wall without crashing and then quickly turned around the get the 'moving away from wall' reward it would just optimize this repeated behavior instead of whatever task you had actually intended for it to optimize.
     
    legitimatesd likes this.
  4. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    The agents needs to know everything you would need to know in order perform the same task yourself. This depends entirely on the complexity of the environment and the complexity of the task, but my simple heuristic is to give it more information than I think it needs until I can prove it's learning a decent policy. Once I've proved out that the task can be learned, I start removing observations that I think aren't very useful and watch the training results for a decline (adding back in if I see a decline that is too severe).

    Most commonly the number of observations required is under-estimated. So let's break down what an agent would need to 'know about' from your example. Say we have a gun and a moving target some distance away. We'll assume the agent is only the gun for simplicity.
    • Agent
      • position (so it can derive the position of other objects)
      • orientation (so it can correlate with adjustment actions for aiming)
      • velocity (so it can derive which direction it's come from and which direction it will move in the future)
      • speed - (OPTIONAL - the agent can infer its speed based on position and velocity but providing it already calculated will speed up training)
    • Target
      • position (so the agent can derive the targets location relative to itself)
      • orientation (a sphere target is symmetrical from any orientation but a thin circular target is not and shots will miss if fired at the wrong orientation)
      • velocity (so the agent can derive which direction it's come from and which direction it will move in the future)
      • speed - (OPTIONAL - speeds up training)
      • distance to target (OPTIONAL - the agent can infer the distance based on other observations but just giving it the distance will speed up training)
    • The projectile
      • position
      • orientation
      • velocity
      • speed - (OPTIONAL)
      • rate of decay or drop (OPTIONAL)
    • We'll also need a raycast from the agent to the target to detect anything between (blocking) them, which require observations for -
      • collision detected
      • collider tag (can this object be shot through?)
      • collider position
      • collider orientation
      • collider velocity
      • speed (OPTIONAL)
    I believe all that would provide the fastest training run possible however that may not be the goal, as noted some observations can definitely be removed at the cost of additional training steps which would be necessary for situations where the policy will not have access to those measurements when deployed as in the case of robotics.
     
    legitimatesd likes this.
  5. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    Yes but not through the rewards, just use statsrecorder when the reward is assigned - https://github.com/Unity-Technologi...sing-Tensorboard.md#custom-metrics-from-unity

    This is decision based on intuition and experience, the only general guidance you can get is that a larger network is required for more complex tasks. Deep RL doesn't really have the same problems with overfitting as supervised learning though so oversizing isn't usually a problem. I usually just start bigger than I think is necessary and reduce until testing shows a decline.
     
    legitimatesd likes this.