Search Unity

Question Getting NaNs for multiple policy stats on tensorboard

Discussion in 'ML-Agents' started by Luke-Houlihan, Sep 1, 2020.

  1. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    Hello, I'm having a strange issue while trying to train a fairly advanced simulation quadruped to walk (similar in many ways to the crawler example). Training with the PPO algorithm produces ok results at around 35 million steps and I was hoping the SAC algorithm could be better or faster but training collapses around 1.3m steps and some of the policy stats seem to explode into NaN at 3.2m steps.

    Where can I begin to troubleshoot this?
    Is this a hyper parameter issue?

    Here is the config file I'm using -
    Code (CSharp):
    1. behaviors:
    2.   CanemTreadmillPPO:
    3.     trainer_type: ppo
    4.     hyperparameters:
    5.       batch_size: 2024
    6.       buffer_size: 20240
    7.       learning_rate: 0.0003
    8.       beta: 0.005
    9.       epsilon: 0.2
    10.       lambd: 0.95
    11.       num_epoch: 3
    12.       learning_rate_schedule: linear
    13.     network_settings:
    14.       normalize: true
    15.       hidden_units: 512
    16.       num_layers: 3
    17.     reward_signals:
    18.       extrinsic:
    19.         gamma: 0.995
    20.         strength: 1.0
    21.     max_steps: 40000000
    22.     time_horizon: 64
    23.     summary_freq: 20000
    24.     keep_checkpoints: 5
    25.  
    26.  
    27.   CanemTreadmillSAC_NOMEM:
    28.     trainer_type: sac
    29.     max_steps: 10000000
    30.     time_horizon: 512
    31.     summary_freq: 20000
    32.     keep_checkpoints: 5
    33.     checkpoint_interval: 500000
    34.     hyperparameters:
    35.       batch_size: 256
    36.       buffer_size: 2000000
    37.       learning_rate: 2e-4
    38.       learning_rate_schedule: constant
    39.       buffer_init_steps: 10000
    40.       tau: 0.005
    41.       steps_per_update: 15
    42.       save_replay_buffer: false
    43.       init_entcoef: 0.2
    44.     network_settings:
    45.       vis_encoder_type: simple
    46.       normalize: true
    47.       hidden_units: 512
    48.       num_layers: 3
    49.     reward_signals:
    50.       extrinsic:
    51.         strength: 1.0
    52.         gamma: 0.99
    53.       curiosity:
    54.         strength: .05
    55.         gamma: 0.995
    56.         encoding_size: 128
    57.         learning_rate: 3e-4
    Thanks
     
  2. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    Some additional info -
    Vector observations - 169
    Vector actions - Discrete - 12 branches x 8 paths

    Version information:
    ml-agents: 0.19.0,
    ml-agents-envs: 0.19.0,
    Communicator API: 1.0.0,
    TensorFlow: 2.3.0
     
    Last edited: Sep 1, 2020
  3. christophergoy

    christophergoy

    Joined:
    Sep 16, 2015
    Posts:
    735
    Hi @Luke-Houlihan,
    This sounds like it may be a bug. Could you try training without curiosity to see if the NaNs still show up? For now, I'm going to log this in our internal tracker.
     
  4. christophergoy

    christophergoy

    Joined:
    Sep 16, 2015
    Posts:
    735
    Hi @Luke-Houlihan,
    I've logged this issue internally as MLA-1343 and have notified our team of this. Please let us know if you can share your project or more information that you discover. Cheers.
     
  5. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    @christophergoy Thank you for taking a look at this.

    I have done another iteration changing some of the config and did remove the curiosity property. Results are below.





    Some stats continue to break down and training stops at that point (still around 3m steps).

    Here is the training config used in this round -
    Code (CSharp):
    1. CanemTreadmillSAC_NOMEM:
    2.     trainer_type: sac
    3.     hyperparameters:
    4.       batch_size: 1024
    5.       buffer_size: 2000000
    6.       learning_rate: 0.0003
    7.       learning_rate_schedule: constant
    8.       buffer_init_steps: 0
    9.       tau: 0.005
    10.       steps_per_update: 10
    11.       save_replay_buffer: false
    12.       init_entcoef: 0.2
    13.     network_settings:
    14.       vis_encoder_type: simple
    15.       normalize: true
    16.       hidden_units: 256
    17.       num_layers: 3
    18.     reward_signals:
    19.       extrinsic:
    20.         strength: 1.0
    21.         gamma: 0.995
    22.     max_steps: 10000000
    23.     time_horizon: 1000
    24.     summary_freq: 20000
    25.     keep_checkpoints: 5
    26.     checkpoint_interval: 500000
     
    Last edited: Sep 3, 2020
  6. christophergoy

    christophergoy

    Joined:
    Sep 16, 2015
    Posts:
    735
    Hi @Luke-Houlihan,
    Would you be willing to share your project with us? Even if it's adding us as collaborators on your project on github or something? We don't have a way to reproduce this at the moment.
     
  7. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
  8. christophergoy

    christophergoy

    Joined:
    Sep 16, 2015
    Posts:
    735
  9. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    Update: I managed to get the NaNs off the tensorboard by moving all reward actions out of
    FixedUpdate()
    and into
    OnActionReceived(float[] vectorAction)
    .

    Which narrows the problem down to options -
    • AddReward()
      used in both
      FixedUpdate()
      and
      OnActionReceived(float[] vectorAction)
    • SetReward()
      used in both
      FixedUpdate()
      and
      OnActionReceived(float[] vectorAction)
    • SetReward()
      and/or
      EndEpisode()
      used in
      FixedUpdate()
    Rewarding in either method seems to be OK based the examples so my theory is I'm doing rewarding incorrectly either by doing it in both or doing something in
    FixedUpdate()
    that I'm not supposed to
     
  10. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    @christophergoy False alarm, it looks to me like the NaNs are being (correctly) caused by an overflowing observation float. The timer variable is incremented by delta time and used as an observation so the agent sees time passing, this needs to be reset back to 0 on every
    EndEpisode()
    or it will eventually overflow.

    I feel a little silly for having posted this now but man that was a tough nut to crack.
     
  11. christophergoy

    christophergoy

    Joined:
    Sep 16, 2015
    Posts:
    735
    Glad you figured it out!