Search Unity

  1. Looking for a job or to hire someone for a project? Check out the re-opened job forums.
    Dismiss Notice
  2. Unity 2020 LTS & Unity 2021.1 have been released.
    Dismiss Notice
  3. Good news ✨ We have more Unite Now videos available for you to watch on-demand! Come check them out and ask our experts any questions!
    Dismiss Notice

Multiple Environments, Time Horizon, Batch-Size and On-Policy-Training

Discussion in 'ML-Agents' started by unity_-DoCqyPS6-iU3A, Apr 5, 2021.

  1. unity_-DoCqyPS6-iU3A

    unity_-DoCqyPS6-iU3A

    Joined:
    Aug 18, 2018
    Posts:
    19
    Hello everyone,

    I could use some insights into how ml-agents deals with trajectories that include observations from a previous policy (in my case: because the "number of arenas" x "time_horizon" > "buffer_size")

    My environment has a sparse reward (sparse meaning ~1 major reward / 20000 timestep-episode)
    I use reward-shaping to give more immediate rewards to assist training.

    My current processor's sweetspot for most actions/s is at num-envs=2 with ~32 arenas each.

    I set my time_horizon to 600 (because I think that this is the amout of crucial actions that lead to the reward in most cases)
    and my buffer_size to 20480 (a compromise between number of unique trajectories and the amount of policy-updates I can get out of the collected action/observation/reward-triplets.)

    So, with 64 arenas running simultaneously my buffer fills after just 310 time-steps. Which is shorter than the time_horizon of 600.

    My agents are moving all the time (even when the policy is updating), so I'm assuming that the buffer is filled with actions generated from the previous policy at that time.

    Since ppo is on-policy, here is what I think ml-agents could be doing in the background.

    1) get the full 600-step-trajectory - but include actions/observation/reward-triplets from the previous policy for the first 290 steps. (thus doing the next policy-update with mismatched data, slightly violationg the on-policy-assumption)
    2) only process the 310 timesteps that were collected with the current policy (throwing away all other 290 actions/observation/reward-triplets)

    My agent *does* train, so what I'm doing can't be *that* wrong, but I'm wondering if ml-agents can even use the additional action/observation/reward-triplets or if it's even counter-productive?


    And, yes, it would be easy for me to increase the buffer-size or decrease the number of arenas.
    But I will get a new processor soon, and I was hoping that I could just run more environments in parallel to speed up training.
    Depending on the inner workings of ppo/ml-agents that may not be possible.
     
  2. unity_-DoCqyPS6-iU3A

    unity_-DoCqyPS6-iU3A

    Joined:
    Aug 18, 2018
    Posts:
    19
    I skimmed over the examples and "WallJump" has the following default-settings:

    https://github.com/Unity-Technologi...7ff6cbb6ca5f1172a/config/ppo/WallJump.yaml#L6
    buffer_size: 2048
    time_horizon: 128

    and the number of arenas in the example-scene is 24.

    Which means that in this case
    "number of arenas" x "time horizon" is also larger than "buffer_size".

    At least that shows me that the question I have is not just due to my hyperparameters, but applies to a broader set of training-scenarios.
     
  3. ruoping_unity

    ruoping_unity

    Unity Technologies

    Joined:
    Jul 10, 2020
    Posts:
    77
    The agent can run more than one time_horizons during between each update.
    For PPO, it will only trigger policy update when the buffer has been filled with enough data to train and will clear out the buffer after every update. All the data will come from the current policy and the training is on-policy.

    And you mentioned "My agents are moving all the time (even when the policy is updating), so I'm assuming that the buffer is filled with actions generated from the previous policy at that time". I think this is not exactly correct in ml-agents. When the policy is updating you can actually see the simulation freezes for a short period of time, and resumes after the update is completed.
     
  4. unity_-DoCqyPS6-iU3A

    unity_-DoCqyPS6-iU3A

    Joined:
    Aug 18, 2018
    Posts:
    19
    Hello ruoping, thank you for the answers.

    Yeah, I had a few misconceptions about mlagents when I wrote that post, but I think I understand now.

    Maybe minBufferSize would be a more intuitive name for the buffer.

    My agents *were* actually moving during the policy-update. But that was due to having threading=true in my config-file. There's already been an pull-request disabling threading by default, so I disabled it for my tests as well.

    Oh, and about the Wall-Jump-example? Turns out that some of the agents are training with behavior "SmallWallJump" and some are using "BigWallJump", so my numbers don't even make sense. Just wanted to put that out there in case somebody sees this thread...
     
unityunity