Search Unity

Bug Training crashes after multiple days running, unable to resume or recover.

Discussion in 'ML-Agents' started by al3xj3ns3n, Mar 31, 2023.

  1. al3xj3ns3n

    al3xj3ns3n

    Joined:
    Sep 10, 2018
    Posts:
    21
    I've been doing some long-term training exercises with my AI project to see what sort of behaviors will develop over time, but I've been hitting a roadblock which prevents me from running the training for more than a few days. The training progresses fine for a really long time but after multiple days this trace callback suddenly shows up in the console.


    Traceback (most recent call last):
    File "D:\Program Files\Python37\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
    File "D:\Program Files\Python37\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\Scripts\mlagents-learn.exe\__main__.py", line 7, in <module>
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\learn.py", line 250, in main
    run_cli(parse_command_line())
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\learn.py", line 246, in run_cli
    run_training(run_seed, options)
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\learn.py", line 125, in run_training
    tc.start_learning(env_manager)
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
    return func(*args, **kwargs)
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\trainer_controller.py", line 176, in start_learning
    n_steps = self.advance(env_manager)
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
    return func(*args, **kwargs)
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\trainer_controller.py", line 234, in advance
    new_step_infos = env_manager.get_steps()
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\env_manager.py", line 124, in get_steps
    new_step_infos = self._step()
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 298, in _step
    self._queue_steps()
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 291, in _queue_steps
    env_action_info = self._take_step(env_worker.previous_step)
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
    return func(*args, **kwargs)
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 430, in _take_step
    step_tuple[0], last_step.worker_id
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\policy\torch_policy.py", line 214, in get_action
    self.check_nan_action(run_out.get("action"))
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\policy\policy.py", line 147, in check_nan_action
    raise RuntimeError("Continuous NaN action detected.")
    RuntimeError: Continuous NaN action detected.

    After this error displays, the training can not be resumed and the .onnx file is not useable anymore, preventing any further training from occurring. What causes this sort of problem? Do I need to adjust my training parameters?
     
  2. Lion_C

    Lion_C

    Joined:
    Oct 13, 2021
    Posts:
    8
    Encountered the same error.

    - The training had run around 160M steps when it crashed.
    - The resulting .onnx file was unusable in inference.
    - Resuming the training would trigger the same error immediately.

     
    Last edited: Jun 1, 2023
  3. Lion_C

    Lion_C

    Joined:
    Oct 13, 2021
    Posts:
    8
  4. GamerLordMat

    GamerLordMat

    Joined:
    Oct 10, 2019
    Posts:
    185
  5. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    I agree with @GamerLordMat here, NaNs in the action space are usually caused by non-normalized or incorrectly normalized values in the observation space. If you provide your observation space code and whether you set
    normalization: true
    in your agent config file we can see if that's part of the problem
     
  6. EternalMe

    EternalMe

    Joined:
    Sep 12, 2014
    Posts:
    183
    I still think its also a problem on ML-agents side. Because once the NaN are generated, I can not recover the learning process. There should be a protection against this. I.e. Physics can glitch and as GamerLordMat said, observed object can fall into abiss, but that should not brake everything.
     
  7. GamerLordMat

    GamerLordMat

    Joined:
    Oct 10, 2019
    Posts:
    185
    Right? Why do I have to write a script that limits the max velocity for all rigidbodies? I use a computeshader to do it (otherwise it can get slow, either calc it every x seconds or like 100 at eachf rame modulo size) but it still should be standard. NaN can be detected more easily , just dont devide by zero, there are not so many pitfalls.