Search Unity

Difference in SAC training between ml-agents 0.13.0, 0.14.1 and 0.15.0

Discussion in 'ML-Agents' started by kiara_ottogalli, Mar 10, 2020.

  1. kiara_ottogalli

    kiara_ottogalli

    Joined:
    Jan 23, 2017
    Posts:
    9
    I've been working in a project in which I used ml-agents from 0.8.0 to 0.13.0 and the agent trained well with PPO and SAC since it was released for 0.10.0. In 0.13.0 my agent learned fast with SAC (in about 300k steps) as it can be seen in the image:

    In this case, the reward stabilizes at 300k more or less at 12, and it is ok for the episode length to decrease as the agent needs to learn to do the action as fast as it can, up to 25.

    For training in 0.14.1 and 0.15.0 I followed the instructions in the migration guide, and multiplied the max_steps by the number of agents in my environment, used the same parameters as in 0.13.0, but the training didn't converge:

    The reward never stabilizes and doesn't reach 4, while the episode length stays between 350 and 500 which is the max number of steps of my agent.

    The parameters I used in 0.13.0 were:
    RobotLearning:
    trainer: sac
    batch_size: 128
    buffer_size: 500000
    buffer_init_steps: 0
    hidden_units: 128
    init_entcoef: 1.0
    learning_rate: 0.0003
    learning_rate_schedule: constant
    max_steps: 500000
    memory_size: 256
    normalize: true
    num_update: 1
    train_interval: 1
    num_layers: 2
    time_horizon: 1000
    sequence_length: 64
    summary_freq: 3000
    tau: 0.005
    use_recurrent: false
    vis_encode_type: simple
    reward_signals:
    extrinsic:
    strength: 1.0
    gamma: 0.99

    And for 0.14.1 and 0.15.0 max_steps: 12500000 and summary_freq: 75000 (I have 25 agents).

    I'd like to know if there is another parameter I'm missing.

    Thanks in advance for any help.
     
  2. TreyK-47

    TreyK-47

    Unity Technologies

    Joined:
    Oct 22, 2019
    Posts:
    1,822
    We'll circulate this for the team to review - could you tell us which version of Python and C# you're using?
     
  3. kiara_ottogalli

    kiara_ottogalli

    Joined:
    Jan 23, 2017
    Posts:
    9
    I'm using python 3.6.10 and .NET Framework 4.8.03752 (I assume C# 7.3).
    Thank you!
     
  4. andrewcoh_unity

    andrewcoh_unity

    Unity Technologies

    Joined:
    Sep 5, 2019
    Posts:
    162
    It looks like your buffer_size is equal to your max_steps! You should set your buffer size back to what it was originally and only multiply your max steps by the number of agents in the scene. Let us know if it works!
     
  5. kiara_ottogalli

    kiara_ottogalli

    Joined:
    Jan 23, 2017
    Posts:
    9
    Hi, thanks for your response.

    Yes, the buffer_size is equal to the max_steps, but that is the value that I had originally for buffer_size in the training configuration that was working in 0.13.0. I'm using the same value for 0.15.0, I multiplied only max_steps and summary_freq.

    Anyway, I'm going to try a lower value for buffer_size to see if it makes it work. I'll let you know.
     
  6. kiara_ottogalli

    kiara_ottogalli

    Joined:
    Jan 23, 2017
    Posts:
    9
    I tried with a buffer_size of 50000 and the results were similar in 0.15.0:

    The pink line is the new training. I'm going to try changing other parameters to see if there's a change in convergence.
     
  7. ervteng_unity

    ervteng_unity

    Unity Technologies

    Joined:
    Dec 6, 2018
    Posts:
    150
    Hi kiara_ottogalli, have you seen any difference in training behavior in PPO as well? It would also be useful to post (for SAC) the comparison of the two runs in Policy Loss, Entropy, and Entropy Coefficient. Thanks!
     
  8. kiara_ottogalli

    kiara_ottogalli

    Joined:
    Jan 23, 2017
    Posts:
    9
    Hi ervteng_unity,
    Here are the tensorboard graphics for SAC training with 0.13.0:

    and SAC training with 0.15.0:

    I'm going to do the same for PPO to see if there's any change.
    Thanks in advance for any help!
     
  9. ervteng_unity

    ervteng_unity

    Unity Technologies

    Joined:
    Dec 6, 2018
    Posts:
    150
    Hi kiara_ottogalli, nothing seems too out of place with these plots - in fact, the 2nd one is likely to keep learning if you let it go longer. But the difference is very stark and a bit puzzling.

    Is the first run repeatable, i.e. on 0.13.0 if you re-train from scratch does it always learn quickly? Sometimes (especially in environments where the reward is fairly sparse/rare), the agent gets lucky in the beginning and can learn, whereas if it gets unlucky it will be stuck exploring for a long while.
     
    Hsgngr likes this.
  10. kiara_ottogalli

    kiara_ottogalli

    Joined:
    Jan 23, 2017
    Posts:
    9
    Hi ervteng_unity, sorry for the late response.

    The first run is repeatable on ml-agents 0.13.0. The environment has rewards every step, so sparse rewards are not a problem. Here are the TB graphics for 3 repetitions of SAC in 0.13.0:

    As you can see here, it always learns quickly, but in 0.15.0 it doesn't.

    Also, I did some runs for PPO with ml-agents 0.10.1 (red), 0.13.0 (orange) and 0.15.0 (blue):

    In this case, the red and orange lines are similar in most of the graphics, but the blue one has some peaks, and the entropy graphic is very different for 0.15.0 too.

    If we compare SAC on 0.13.0 with 0.15.0 in entropy, they are different too:

    Maybe a change to beta on PPO or ent_coeff on SAC is needed with the version change?
     
  11. ervteng_unity

    ervteng_unity

    Unity Technologies

    Joined:
    Dec 6, 2018
    Posts:
    150
    Thanks for the detailed analysis! We're looking into the differences. By any chance when you're upgrading from 0.13.0 to 0.15.0 are you also upgrading the C# package? There have been some changes to the raycast sensors if you're using those.

    Also, have you tried 0.14.0? If it's not too hard that would help us a lot in narrowing down the problem. Thanks!
     
  12. kiara_ottogalli

    kiara_ottogalli

    Joined:
    Jan 23, 2017
    Posts:
    9
    Thanks for the response!

    Yes, with every version I change the ML-Agents folder and install the package from ml-agents/com.unity.ml-agents with Unity's package manager. I also checked the steps to migrate (https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Migrating.md).

    I'm not using raycast sensors, only Vector3 observations.

    I'm gonna try with that version and let you know.
     
  13. ervteng_unity

    ervteng_unity

    Unity Technologies

    Joined:
    Dec 6, 2018
    Posts:
    150
    If you're trying 0.14 - I'd use 0.14.1, forgot that there were some bugfixes there. Thanks!
     
  14. kiara_ottogalli

    kiara_ottogalli

    Joined:
    Jan 23, 2017
    Posts:
    9
    Hi! I used ml-agents 0.14.1 for SAC (blue) and PPO (pink), and here are the results:

    The parameters were:


    PPOLearning:
    batch_size: 2024
    buffer_size: 20240
    max_steps: 12500000
    normalize: true
    num_epoch: 3
    time_horizon: 1000
    summary_freq: 75000
    reward_signals:
    extrinsic:
    strength: 1.0
    gamma: 0.995

    SACLearning:
    batch_size: 128
    buffer_size: 500000
    max_steps: 12500000
    normalize: true
    time_horizon: 1000
    summary_freq: 75000


    With the same params it works for me in versions 0.10.1, 0.13.0 and 0.14.1. The only one I can't get to work is the 0.15.0 one (SAC-orange, PPO-Dark blue):
     
  15. ervteng_unity

    ervteng_unity

    Unity Technologies

    Joined:
    Dec 6, 2018
    Posts:
    150
    Hi kiara_ottogalli, thanks so much for these plots! We did make a large change to the trainer internals in 0.15.0 so we'll take a look. There have also been smaller changes and fixes after that, so it's possible that the latest master version doesn't have these issues.

    Forgot to ask earlier, but what does your action space look like? (discrete/continuous, how many actions?)
     
    Last edited: Mar 24, 2020
  16. kiara_ottogalli

    kiara_ottogalli

    Joined:
    Jan 23, 2017
    Posts:
    9
    Hi ervteng_unity, here are my state and action spaces (for ml-agents 0.14.1):

    As you can see the action space has 5 continuous actions. My decision requester has a decision period of 1 with repeat action enabled and no offset.
    I hope this information helps!
     
  17. ervteng_unity

    ervteng_unity

    Unity Technologies

    Joined:
    Dec 6, 2018
    Posts:
    150
    Hi kiara_ottogalli, apologies for the delayed response. The info does help narrow it down quite a bit. TBH I haven't been able to replicate the regression with our example environments.

    It looks like your agent's entropy is super high in the failure case. I'm wondering if there is something different about how the agent is exploring. When the agent fails to learn, what is it doing?

    We also released 0.15.1 last week which had a couple bugfixes that should help with continuous actions (esp. for PPO) - it might help a bit.