Search Unity

Recovery of NN File?

Discussion in 'ML-Agents' started by James_Initus, Jun 22, 2020.

  1. James_Initus

    James_Initus

    Joined:
    May 26, 2015
    Posts:
    75
    Had an agent run for over 10 hours then an error occurred and I do not see an NN file. Is there a way to recover this?


    Version information:
    ml-agents: 0.17.0,
    ml-agents-envs: 0.17.0,
    Communicator API: 1.0.0,
    TensorFlow: 2.0.0


    Error Information

    2020-06-22 08:16:26 INFO [stats.py:111] EnemyBehavior: Step: 29070000. Time Elapsed: 38175.311 s Mean Reward: 1634.786. Std of Reward: 399.844. Training.
    Process Process-1:

    Traceback (most recent call last):
    File "c:\users\james\appdata\local\programs\python\python36\lib\multiprocessing\queues.py", line 243, in _feed
    send_bytes(obj)
    File "c:\users\james\appdata\local\programs\python\python36\lib\multiprocessing\connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:eek:ffset + size])
    File "c:\users\james\appdata\local\programs\python\python36\lib\multiprocessing\connection.py", line 290, in _send_bytes
    nwritten, err = ov.GetOverlappedResult(True)
    BrokenPipeError: [WinError 109] The pipe has been ended
    Traceback (most recent call last):
    File "c:\users\james\appdata\local\programs\python\python36\lib\multiprocessing\connection.py", line 312, in _recv_bytes
    nread, err = ov.GetOverlappedResult(True)
    BrokenPipeError: [WinError 109] The pipe has been ended

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
    File "c:\users\james\appdata\local\programs\python\python36\lib\multiprocessing\process.py", line 249, in _bootstrap
    self.run()
    File "c:\users\james\appdata\local\programs\python\python36\lib\multiprocessing\process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
    File "c:\users\james\appdata\local\programs\python\python36\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 151, in worker
    req: EnvironmentRequest = parent_conn.recv()
    File "c:\users\james\appdata\local\programs\python\python36\lib\multiprocessing\connection.py", line 250, in recv
    buf = self._recv_bytes()
    File "c:\users\james\appdata\local\programs\python\python36\lib\multiprocessing\connection.py", line 321, in _recv_bytes
    raise EOFError
    EOFError
     
  2. BotAcademy

    BotAcademy

    Joined:
    May 15, 2020
    Posts:
    32
    You could try to start training again with the --resume argument and the run-id you used. Let it train for a few seconds and then stop training via ctrl + c. Training should resume from the latest checkpoint that was created during your 10h run
     
  3. celion_unity

    celion_unity

    Joined:
    Jun 12, 2019
    Posts:
    289
    Correct - the .nn file is only generated at the end of training, but using --resume should allow training to continue from a tensorflow checkpoint.