Had an agent run for over 10 hours then an error occurred and I do not see an NN file. Is there a way to recover this? Version information: ml-agents: 0.17.0, ml-agents-envs: 0.17.0, Communicator API: 1.0.0, TensorFlow: 2.0.0 Error Information 2020-06-22 08:16:26 INFO [stats.py:111] EnemyBehavior: Step: 29070000. Time Elapsed: 38175.311 s Mean Reward: 1634.786. Std of Reward: 399.844. Training. Process Process-1: Traceback (most recent call last): File "c:\users\james\appdata\local\programs\python\python36\lib\multiprocessing\queues.py", line 243, in _feed send_bytes(obj) File "c:\users\james\appdata\local\programs\python\python36\lib\multiprocessing\connection.py", line 200, in send_bytes self._send_bytes(m[offsetffset + size]) File "c:\users\james\appdata\local\programs\python\python36\lib\multiprocessing\connection.py", line 290, in _send_bytes nwritten, err = ov.GetOverlappedResult(True) BrokenPipeError: [WinError 109] The pipe has been ended Traceback (most recent call last): File "c:\users\james\appdata\local\programs\python\python36\lib\multiprocessing\connection.py", line 312, in _recv_bytes nread, err = ov.GetOverlappedResult(True) BrokenPipeError: [WinError 109] The pipe has been ended During handling of the above exception, another exception occurred: Traceback (most recent call last): File "c:\users\james\appdata\local\programs\python\python36\lib\multiprocessing\process.py", line 249, in _bootstrap self.run() File "c:\users\james\appdata\local\programs\python\python36\lib\multiprocessing\process.py", line 93, in run self._target(*self._args, **self._kwargs) File "c:\users\james\appdata\local\programs\python\python36\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 151, in worker req: EnvironmentRequest = parent_conn.recv() File "c:\users\james\appdata\local\programs\python\python36\lib\multiprocessing\connection.py", line 250, in recv buf = self._recv_bytes() File "c:\users\james\appdata\local\programs\python\python36\lib\multiprocessing\connection.py", line 321, in _recv_bytes raise EOFError EOFError
You could try to start training again with the --resume argument and the run-id you used. Let it train for a few seconds and then stop training via ctrl + c. Training should resume from the latest checkpoint that was created during your 10h run
Correct - the .nn file is only generated at the end of training, but using --resume should allow training to continue from a tensorflow checkpoint.