Bug Training crashes after multiple days running, unable to resume or recover.

al3xj3ns3n · Mar 31, 2023

I've been doing some long-term training exercises with my AI project to see what sort of behaviors will develop over time, but I've been hitting a roadblock which prevents me from running the training for more than a few days. The training progresses fine for a really long time but after multiple days this trace callback suddenly shows up in the console.

Traceback (most recent call last):
File "D:\Program Files\Python37\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "D:\Program Files\Python37\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\Scripts\mlagents-learn.exe\__main__.py", line 7, in <module>
File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\learn.py", line 250, in main
run_cli(parse_command_line())
File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\learn.py", line 246, in run_cli
run_training(run_seed, options)
File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\learn.py", line 125, in run_training
tc.start_learning(env_manager)
File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
return func(*args, **kwargs)
File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\trainer_controller.py", line 176, in start_learning
n_steps = self.advance(env_manager)
File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
return func(*args, **kwargs)
File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\trainer_controller.py", line 234, in advance
new_step_infos = env_manager.get_steps()
File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\env_manager.py", line 124, in get_steps
new_step_infos = self._step()
File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 298, in _step
self._queue_steps()
File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 291, in _queue_steps
env_action_info = self._take_step(env_worker.previous_step)
File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
return func(*args, **kwargs)
File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 430, in _take_step
step_tuple[0], last_step.worker_id
File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\policy\torch_policy.py", line 214, in get_action
self.check_nan_action(run_out.get("action"))
File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\policy\policy.py", line 147, in check_nan_action
raise RuntimeError("Continuous NaN action detected.")
RuntimeError: Continuous NaN action detected.

After this error displays, the training can not be resumed and the .onnx file is not useable anymore, preventing any further training from occurring. What causes this sort of problem? Do I need to adjust my training parameters?

Lion_C · Jun 1, 2023

Encountered the same error.

- The training had run around 160M steps when it crashed.
- The resulting .onnx file was unusable in inference.
- Resuming the training would trigger the same error immediately.

Version information:
ml-agents: 0.30.0,
ml-agents-envs: 0.30.0,
Communicator API: 1.5.0,
PyTorch: 1.13.1+cu117
Click to expand...

Traceback (most recent call last):
File "D:\Unity\mygame\python-envs\default\Scripts\mlagents-learn-script.py", line 33, in <module>
sys.exit(load_entry_point('mlagents==0.30.0', 'console_scripts', 'mlagents-learn')())
File "D:\Unity\mygame\python-envs\default\lib\site-packages\mlagents\trainers\learn.py", line 264, in main
run_cli(parse_command_line())
File "D:\Unity\mygame\python-envs\default\lib\site-packages\mlagents\trainers\learn.py", line 260, in run_cli
run_training(run_seed, options, num_areas)
File "D:\Unity\mygame\python-envs\default\lib\site-packages\mlagents\trainers\learn.py", line 136, in run_training
tc.start_learning(env_manager)
File "D:\Unity\mygame\python-envs\default\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
return func(*args, **kwargs)
File "D:\Unity\mygame\python-envs\default\lib\site-packages\mlagents\trainers\trainer_controller.py", line 175, in start_learning
n_steps = self.advance(env_manager)
File "D:\Unity\mygame\python-envs\default\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
return func(*args, **kwargs)
File "D:\Unity\mygame\python-envs\default\lib\site-packages\mlagents\trainers\trainer_controller.py", line 233, in advance
new_step_infos = env_manager.get_steps()
File "D:\Unity\mygame\python-envs\default\lib\site-packages\mlagents\trainers\env_manager.py", line 124, in get_steps
new_step_infos = self._step()
File "D:\Unity\mygame\python-envs\default\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 408, in _step
self._queue_steps()
File "D:\Unity\mygame\python-envs\default\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 302, in _queue_steps
env_action_info = self._take_step(env_worker.previous_step)
File "D:\Unity\mygame\python-envs\default\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
return func(*args, **kwargs)
File "D:\Unity\mygame\python-envs\default\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 543, in _take_step
all_action_info[brain_name] = self.policies[brain_name].get_action(
File "D:\Unity\mygame\python-envs\default\lib\site-packages\mlagents\trainers\policy\torch_policy.py", line 132, in get_action
self.check_nan_action(run_out.get("action"))
File "D:\Unity\mygame\python-envs\default\lib\site-packages\mlagents\trainers\policy\policy.py", line 126, in check_nan_action
raise RuntimeError("Continuous NaN action detected.")
RuntimeError: Continuous NaN action detected.
Click to expand...

Lion_C · Jun 1, 2023

Opened an issue on the repo: https://github.com/Unity-Technologies/ml-agents/issues/5925

GamerLordMat · Jun 1, 2023

Lion-Chen said: ↑

Opened an issue on the repo: https://github.com/Unity-Technologies/ml-agents/issues/5925
Click to expand...

...RuntimeError("Continuous NaN action detected.") seems to be an error on your side, like something fell into the abiss and has position NaN

Luke-Houlihan · Jun 2, 2023

I agree with @GamerLordMat here, NaNs in the action space are usually caused by non-normalized or incorrectly normalized values in the observation space. If you provide your observation space code and whether you set
normalization: true
in your agent config file we can see if that's part of the problem

EternalMe · Jan 8, 2024

Luke-Houlihan said: ↑
I agree with @GamerLordMat here, NaNs in the action space are usually caused by non-normalized or incorrectly normalized values in the observation space. If you provide your observation space code and whether you set
normalization: true
in your agent config file we can see if that's part of the problem
Click to expand...
I still think its also a problem on ML-agents side. Because once the NaN are generated, I can not recover the learning process. There should be a protection against this. I.e. Physics can glitch and as GamerLordMat said, observed object can fall into abiss, but that should not brake everything.

GamerLordMat · Jan 8, 2024

EternalMe said: ↑

I still think its also a problem on ML-agents side. Because once the NaN are generated, I can not recover the learning process. There should be a protection against this. I.e. Physics can glitch and as GamerLordMat said, observed object can fall into abiss, but that should not brake everything.
Click to expand...

Right? Why do I have to write a script that limits the max velocity for all rigidbodies? I use a computeshader to do it (otherwise it can get slow, either calc it every x seconds or like 100 at eachf rame modulo size) but it still should be standard. NaN can be detected more easily , just dont devide by zero, there are not so many pitfalls.

Search Unity

Unity ID

Useful Searches

Bug Training crashes after multiple days running, unable to resume or recover.

al3xj3ns3n

Lion_C

Lion_C

GamerLordMat

Luke-Houlihan

EternalMe

GamerLordMat