Search Unity

Bug mlagents-learn crashes at checkpoint interval when LSTM is enabled

Discussion in 'ML-Agents' started by ChillX, Jan 31, 2022.

  1. ChillX

    ChillX

    Joined:
    Jun 16, 2016
    Posts:
    145
    MLAgents version: com.unity.ml-agents@2.1.0-exp.1

    When training with multiple instances and with memory enabled the mlagents-learn crashes at random with the following error message. In this case the crash was at the 4'th checkpoint.

    Note: training can continue using --resume and may or may not crash at the next checkpoint.

    Code (CSharp):
    1.  
    2. C:\Users\username\.conda\envs\UnityML\lib\site-packages\torch\onnx\symbolic_opset9.py:1805: UserWarning: Exporting a model to ONNX with a batch_size other than 1, with a variable length with LSTM can cause an error when running the ONNX model with a different batch size. Make sure to save the model with a batch size of 1, or define the initial states (h0/c0) as inputs of the model.
    3.   "or define the initial states (h0/c0) as inputs of the model. ")
    4. [INFO] Exported results\Cube_01_P1-2_E1-2_O1_C20_B10K_Sigmoid\Enemy\Enemy-195100.onnx
    5. [INFO] Copied results\Cube_01_P1-2_E1-2_O1_C20_B10K_Sigmoid\Enemy\Enemy-195100.onnx to results\Cube_01_P1-2_E1-2_O1_C20_B10K_Sigmoid\Enemy.onnx.
    6. [INFO] Exported results\Cube_01_P1-2_E1-2_O1_C20_B10K_Sigmoid\Player\Player-195100.onnx
    7. [INFO] Copied results\Cube_01_P1-2_E1-2_O1_C20_B10K_Sigmoid\Player\Player-195100.onnx to results\Cube_01_P1-2_E1-2_O1_C20_B10K_Sigmoid\Player.onnx.
    8. Traceback (most recent call last):
    9.   File "C:\Users\username\.conda\envs\UnityML\Scripts\mlagents-learn-script.py", line 33, in <module>
    10.     sys.exit(load_entry_point('mlagents', 'console_scripts', 'mlagents-learn')())
    11.   File "d:\game\mlunity\ml-agents\ml-agents\mlagents\trainers\learn.py", line 250, in main
    12.     run_cli(parse_command_line())
    13.   File "d:\game\mlunity\ml-agents\ml-agents\mlagents\trainers\learn.py", line 246, in run_cli
    14.     run_training(run_seed, options)
    15.   File "d:\game\mlunity\ml-agents\ml-agents\mlagents\trainers\learn.py", line 125, in run_training
    16.     tc.start_learning(env_manager)
    17.   File "d:\game\mlunity\ml-agents\ml-agents-envs\mlagents_envs\timers.py", line 305, in wrapped
    18.     return func(*args, **kwargs)
    19.   File "d:\game\mlunity\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 198, in start_learning
    20.     raise ex
    21.   File "d:\game\mlunity\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 176, in start_learning
    22.     n_steps = self.advance(env_manager)
    23.   File "d:\game\mlunity\ml-agents\ml-agents-envs\mlagents_envs\timers.py", line 305, in wrapped
    24.     return func(*args, **kwargs)
    25.   File "d:\game\mlunity\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 234, in advance
    26.     new_step_infos = env_manager.get_steps()
    27.   File "d:\game\mlunity\ml-agents\ml-agents\mlagents\trainers\env_manager.py", line 124, in get_steps
    28.     new_step_infos = self._step()
    29.   File "d:\game\mlunity\ml-agents\ml-agents\mlagents\trainers\subprocess_env_manager.py", line 310, in _step
    30.     raise env_exception
    31. mlagents_envs.exception.UnityEnvironmentException: Environment shut down with return code 3221225477.
    32.  
    Hyperparameters used:

    Code (CSharp):
    1.     trainer_type: ppo
    2.     hyperparameters:
    3.       batch_size: 512
    4.       buffer_size: 10000
    5.       learning_rate: 0.0002
    6.       beta: 0.01
    7.       epsilon: 0.2
    8.       lambd: 0.95
    9.       num_epoch: 3
    10.       learning_rate_schedule: constant
    11.     network_settings:
    12.       normalize: false
    13.       hidden_units: 1024
    14.       num_layers: 4
    15.       vis_encode_type: simple
    16.       memory:
    17.         sequence_length: 64
    18.         memory_size: 128
    19.     reward_signals:
    20.       extrinsic:
    21.         gamma: 0.99
    22.         strength: 1.0
    23.     keep_checkpoints: 1
    24.     max_steps: 25000000
    25.     time_horizon: 110
    26.     summary_freq: 10000
    27.  
     
  2. ChillX

    ChillX

    Joined:
    Jun 16, 2016
    Posts:
    145
    Another note on this. I am running 10 instances of the environment --num-envs=10

    Also it is quite random. Sometimes it will run for over 20 checkpoints without the error. And then on the next run it will happen every three to four checkpoints.

    Also it is more frequent when max steps is a smaller number like 100 than when max steps is 300 or 500