Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

Bug Training crashes after multiple days running, unable to resume or recover.

Discussion in 'ML-Agents' started by al3xj3ns3n, Mar 31, 2023.

  1. al3xj3ns3n

    al3xj3ns3n

    Joined:
    Sep 10, 2018
    Posts:
    21
    I've been doing some long-term training exercises with my AI project to see what sort of behaviors will develop over time, but I've been hitting a roadblock which prevents me from running the training for more than a few days. The training progresses fine for a really long time but after multiple days this trace callback suddenly shows up in the console.


    Traceback (most recent call last):
    File "D:\Program Files\Python37\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
    File "D:\Program Files\Python37\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\Scripts\mlagents-learn.exe\__main__.py", line 7, in <module>
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\learn.py", line 250, in main
    run_cli(parse_command_line())
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\learn.py", line 246, in run_cli
    run_training(run_seed, options)
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\learn.py", line 125, in run_training
    tc.start_learning(env_manager)
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
    return func(*args, **kwargs)
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\trainer_controller.py", line 176, in start_learning
    n_steps = self.advance(env_manager)
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
    return func(*args, **kwargs)
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\trainer_controller.py", line 234, in advance
    new_step_infos = env_manager.get_steps()
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\env_manager.py", line 124, in get_steps
    new_step_infos = self._step()
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 298, in _step
    self._queue_steps()
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 291, in _queue_steps
    env_action_info = self._take_step(env_worker.previous_step)
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
    return func(*args, **kwargs)
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 430, in _take_step
    step_tuple[0], last_step.worker_id
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\policy\torch_policy.py", line 214, in get_action
    self.check_nan_action(run_out.get("action"))
    File "D:\Users\Jax\Documents\Work\Projects\Machine Learning\Machine Learning\python-envs\python-envs\ml-env\lib\site-packages\mlagents\trainers\policy\policy.py", line 147, in check_nan_action
    raise RuntimeError("Continuous NaN action detected.")
    RuntimeError: Continuous NaN action detected.

    After this error displays, the training can not be resumed and the .onnx file is not useable anymore, preventing any further training from occurring. What causes this sort of problem? Do I need to adjust my training parameters?
     
  2. Lion_C

    Lion_C

    Joined:
    Oct 13, 2021
    Posts:
    7
    Encountered the same error.

    - The training had run around 160M steps when it crashed.
    - The resulting .onnx file was unusable in inference.
    - Resuming the training would trigger the same error immediately.

     
    Last edited: Jun 1, 2023
  3. Lion_C

    Lion_C

    Joined:
    Oct 13, 2021
    Posts:
    7
  4. GamerLordMat

    GamerLordMat

    Joined:
    Oct 10, 2019
    Posts:
    168
  5. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    I agree with @GamerLordMat here, NaNs in the action space are usually caused by non-normalized or incorrectly normalized values in the observation space. If you provide your observation space code and whether you set
    normalization: true
    in your agent config file we can see if that's part of the problem