Bug mlagents-learn crashes at checkpoint interval when LSTM is enabled

ChillX · Jan 31, 2022

MLAgents version: com.unity.ml-agents@2.1.0-exp.1

When training with multiple instances and with memory enabled the mlagents-learn crashes at random with the following error message. In this case the crash was at the 4'th checkpoint.

Note: training can continue using --resume and may or may not crash at the next checkpoint.

Code (CSharp):

C:\Users\username\.conda\envs\UnityML\lib\site-packages\torch\onnx\symbolic_opset9.py:1805: UserWarning: Exporting a model to ONNX with a batch_size other than 1, with a variable length with LSTM can cause an error when running the ONNX model with a different batch size. Make sure to save the model with a batch size of 1, or define the initial states (h0/c0) as inputs of the model.

"or define the initial states (h0/c0) as inputs of the model. ")

[INFO] Exported results\Cube_01_P1-2_E1-2_O1_C20_B10K_Sigmoid\Enemy\Enemy-195100.onnx

[INFO] Copied results\Cube_01_P1-2_E1-2_O1_C20_B10K_Sigmoid\Enemy\Enemy-195100.onnx to results\Cube_01_P1-2_E1-2_O1_C20_B10K_Sigmoid\Enemy.onnx.

[INFO] Exported results\Cube_01_P1-2_E1-2_O1_C20_B10K_Sigmoid\Player\Player-195100.onnx

[INFO] Copied results\Cube_01_P1-2_E1-2_O1_C20_B10K_Sigmoid\Player\Player-195100.onnx to results\Cube_01_P1-2_E1-2_O1_C20_B10K_Sigmoid\Player.onnx.

Traceback (most recent call last):

File "C:\Users\username\.conda\envs\UnityML\Scripts\mlagents-learn-script.py", line 33, in <module>

sys.exit(load_entry_point('mlagents', 'console_scripts', 'mlagents-learn')())

File "d:\game\mlunity\ml-agents\ml-agents\mlagents\trainers\learn.py", line 250, in main

run_cli(parse_command_line())

File "d:\game\mlunity\ml-agents\ml-agents\mlagents\trainers\learn.py", line 246, in run_cli

run_training(run_seed, options)

File "d:\game\mlunity\ml-agents\ml-agents\mlagents\trainers\learn.py", line 125, in run_training

tc.start_learning(env_manager)

File "d:\game\mlunity\ml-agents\ml-agents-envs\mlagents_envs\timers.py", line 305, in wrapped

return func(*args, **kwargs)

File "d:\game\mlunity\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 198, in start_learning

raise ex

File "d:\game\mlunity\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 176, in start_learning

n_steps = self.advance(env_manager)

File "d:\game\mlunity\ml-agents\ml-agents-envs\mlagents_envs\timers.py", line 305, in wrapped

return func(*args, **kwargs)

File "d:\game\mlunity\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 234, in advance

new_step_infos = env_manager.get_steps()

File "d:\game\mlunity\ml-agents\ml-agents\mlagents\trainers\env_manager.py", line 124, in get_steps

new_step_infos = self._step()

File "d:\game\mlunity\ml-agents\ml-agents\mlagents\trainers\subprocess_env_manager.py", line 310, in _step

raise env_exception

mlagents_envs.exception.UnityEnvironmentException: Environment shut down with return code 3221225477.

Hyperparameters used:

Code (CSharp):

trainer_type: ppo

hyperparameters:

batch_size: 512

buffer_size: 10000

learning_rate: 0.0002

beta: 0.01

epsilon: 0.2

lambd: 0.95

num_epoch: 3

learning_rate_schedule: constant

network_settings:

normalize: false

hidden_units: 1024

num_layers: 4

vis_encode_type: simple

memory:

sequence_length: 64

memory_size: 128

reward_signals:

extrinsic:

gamma: 0.99

strength: 1.0

keep_checkpoints: 1

max_steps: 25000000

time_horizon: 110

summary_freq: 10000

ChillX · Feb 1, 2022

Another note on this. I am running 10 instances of the environment --num-envs=10

Also it is quite random. Sometimes it will run for over 20 checkpoints without the error. And then on the next run it will happen every three to four checkpoints.

Also it is more frequent when max steps is a smaller number like 100 than when max steps is 300 or 500

Search Unity

Unity ID

Useful Searches

Bug mlagents-learn crashes at checkpoint interval when LSTM is enabled

ChillX

ChillX