Search Unity

Summaries cause pauses in training with multiple agents

Discussion in 'ML-Agents' started by Keemo, Apr 23, 2021.

  1. Keemo

    Keemo

    Joined:
    Apr 22, 2014
    Posts:
    31
    Hello everyone,

    im facing a strange behavior while training with multiple agents or multiple environments.
    This problem does not occur when training with a single agent:

    Whenever I start a training, everything works fine at the beginning.
    However after the second summary checkpoint is reached, the agents get stuck for a couple of seconds/minutes. Then suddenly the training continues and the output looks like this (In this example i train with 8 Environments):
    upload_2021-4-23_22-45-28.png

    Some Informations about the environment:
    Im using a custom environment where everything is controlled by a counter (Almost like in the Gridworld example, but I count the actions taken and I set a Threshold for the maximum allowed actions/steps).
    A Decision is requested manually when a courotine has finished a webrequest (Agents are async).
    When the counter reaches a threshold the agent failed and EpisodeInterrupted() gets called. If the agents reaches the goal EndEpisode() gets called.

    The training with multiple agents still works. The result is a working Model, however these pausing caused by the summaries are wasting a lot of processing time. When i disable the summaries everything works fine, but without summaries its hard to check which training was the most successful.
     

    Attached Files:

  2. unity_-DoCqyPS6-iU3A

    unity_-DoCqyPS6-iU3A

    Joined:
    Aug 18, 2018
    Posts:
    26
    Can you please show us your config-file? Also, how many (combined) agents do you have across those 8 environments?
     
  3. celion_unity

    celion_unity

    Joined:
    Jun 12, 2019
    Posts:
    289
    The summary steps should be very fast; they just compute averages on a few arrays of numbers, and then pass the results off to things like TensorboardWriter and ConsoleWriter in stats.py.

    What is the batch size in your configuration? The model is updated after that many steps, so a "hitch" in the training is expected then.

    If you want to dig more into where the time is going, and don't mind modifying the source, you can add some
    @timed
    decorators to the
    write_stats
    methods of the classes in stats.py, and look at the resulting file in
    ./results/{$RUN_ID}/run_logs/timers.json
     
  4. Keemo

    Keemo

    Joined:
    Apr 22, 2014
    Posts:
    31
    There is actually just one agent in each environment.

    This is my config file
    Code (CSharp):
    1. behaviors:
    2.   Align:
    3.     trainer_type: sac
    4.     hyperparameters:
    5.       learning_rate: 0.0003
    6.       learning_rate_schedule: constant
    7.       batch_size: 1024
    8.       buffer_size: 1000000
    9.       buffer_init_steps: 0
    10.       tau: 0.005
    11.       steps_per_update: 1.0
    12.       save_replay_buffer: true
    13.       init_entcoef: 3.0
    14.       reward_signal_steps_per_update: 1.0
    15.     network_settings:
    16.       normalize: false
    17.       hidden_units: 64
    18.       num_layers: 2
    19.       vis_encode_type: simple
    20.     reward_signals:
    21.       extrinsic:
    22.         gamma: 0
    23.         strength: 1.0
    24.     keep_checkpoints: 100
    25.     checkpoint_interval: 100000
    26.     max_steps: 150000000
    27.     time_horizon: 1000
    28.     summary_freq: 10000
    29.     threaded: true
    30.  
    I think I explained the problem kind of bad.
    As you can see in the picture the first 1500 Steps take about 80 Seconds.
    The time needed is correct, since im doing a lot of web requests.

    However once there is the seconds summary (Again after about 80 seconds), everything gets stuck and suddenly I reached Step 15.5k. This makes no sense, since 1500 Steps takes like 80 seconds. Why are they now so many summaries within a second?
     
  5. unity_-DoCqyPS6-iU3A

    unity_-DoCqyPS6-iU3A

    Joined:
    Aug 18, 2018
    Posts:
    26
    Oh, SAC? I have no intuition about that algorithm, and don't have a clue what the problem could be, sorry.