Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.

Question Training fails after a short time

Discussion in 'ML-Agents' started by Jager96, Feb 5, 2022.

  1. Jager96

    Jager96

    Joined:
    Feb 26, 2020
    Posts:
    4
    I am trying to train an AI for a snake-like game, but the training keeps failing after running fine for a few minutes with the following errors:



    C:\ProgramData\Anaconda3\lib\site-packages\mlagents\trainers\torch\networks.py:91: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ..\torch\csrc\utils\tensor_new.cpp:201.)
    enc.update_normalization(torch.as_tensor(vec_input))
    OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
    OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
    [ERROR] UnityEnvironment worker 0: environment raised an unexpected exception.
    Traceback (most recent call last):
    File "C:\ProgramData\Anaconda3\lib\multiprocessing\connection.py", line 317, in _recv_bytes
    nread, err = ov.GetOverlappedResult(True)
    BrokenPipeError: [WinError 109] The pipe has been ended

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
    File "C:\ProgramData\Anaconda3\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 175, in worker
    req: EnvironmentRequest = parent_conn.recv()
    File "C:\ProgramData\Anaconda3\lib\multiprocessing\connection.py", line 255, in recv
    buf = self._recv_bytes()
    File "C:\ProgramData\Anaconda3\lib\multiprocessing\connection.py", line 326, in _recv_bytes
    raise EOFError
    EOFError
    Process Process-1:
    Traceback (most recent call last):
    File "C:\ProgramData\Anaconda3\lib\multiprocessing\connection.py", line 317, in _recv_bytes
    nread, err = ov.GetOverlappedResult(True)
    BrokenPipeError: [WinError 109] The pipe has been ended

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
    File "C:\ProgramData\Anaconda3\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 175, in worker
    req: EnvironmentRequest = parent_conn.recv()
    File "C:\ProgramData\Anaconda3\lib\multiprocessing\connection.py", line 255, in recv
    buf = self._recv_bytes()
    File "C:\ProgramData\Anaconda3\lib\multiprocessing\connection.py", line 326, in _recv_bytes
    raise EOFError
    EOFError

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
    File "C:\ProgramData\Anaconda3\lib\multiprocessing\process.py", line 315, in _bootstrap
    self.run()
    File "C:\ProgramData\Anaconda3\lib\multiprocessing\process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    File "C:\ProgramData\Anaconda3\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 235, in worker
    _send_response(EnvironmentCommand.ENV_EXITED, ex)
    File "C:\ProgramData\Anaconda3\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 150, in _send_response
    parent_conn.send(EnvironmentResponse(cmd_name, worker_id, payload))
    File "C:\ProgramData\Anaconda3\lib\multiprocessing\connection.py", line 211, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
    File "C:\ProgramData\Anaconda3\lib\multiprocessing\connection.py", line 285, in _send_bytes
    ov, err = _winapi.WriteFile(self._handle, buf, overlapped=True)
    BrokenPipeError: [WinError 232] The pipe is being closed


    Does anyone have any idea what this means or why it is happening?
     
  2. ChillX

    ChillX

    Joined:
    Jun 16, 2016
    Posts:
    145
    Hope this helps just taking a few guesses here.

    Python uses multi processing as opposed to multi threading. From the error message subprocess_env_manager in ML Agents is using Open MP to create multiple Python wrapper processes around the multiple environment instances.

    Chances are if you run without parallel instances and just connect it to the editor it may probably work fine.

    "OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program."
    If ML Agents had that bug it probably would not work for anyone.

    Maybe something is wrong with the installation. Maybe you could start with reinstalling ML Agents from scratch. That is in Anaconda create a new environment and install a new copy in there and try.

    Also something to check is whether the Python side versions match the Unity side version.
    https://github.com/Unity-Technologies/ml-agents/releases/tag/release_19
     
    Jager96 likes this.
  3. Jager96

    Jager96

    Joined:
    Feb 26, 2020
    Posts:
    4

    I think you are right about the multiple instances causing the issues. I have reduced it back down to one and it seems to be working swimmingly. Thanks for your input.
     
  4. ChillX

    ChillX

    Joined:
    Jun 16, 2016
    Posts:
    145
    Having said that you probably do want multiple instances. Using a single instance will take a lot longer to train than using multiple instances. That is unless you can stack multiple instances of the simulation within the same unity scene like in the examples.
     
  5. Sab_Rango

    Sab_Rango

    Joined:
    Aug 30, 2019
    Posts:
    121
    in my case the issue was coming from Discreate Branches variables in actions . I set it 0 and then every thing was fixed
     
  6. unity_D58FE44BC6B0A8650AF2

    unity_D58FE44BC6B0A8650AF2

    Joined:
    Jan 3, 2023
    Posts:
    1
    i m trying to run mlagents-learn in M1 Pro apple macbook pro,
    i m facing this issue any idea

    (rl) robinroypeter@robinroys-MacBook-Pro config % mlagents-learn ppo/3DBall.yaml --resume --run-id=3


    ┐ ╖

    ╓╖╬│╡ ││╬╖╖

    ╓╖╬│││││┘ ╬│││││╬╖

    ╖╬│││││╬╜ ╙╬│││││╖╖ ╗╗╗

    ╬╬╬╬╖││╦╖ ╖╬││╗╣╣╣╬ ╟╣╣╬ ╟╣╣╣ ╜╜╜ ╟╣╣

    ╬╬╬╬╬╬╬╬╖│╬╖╖╓╬╪│╓╣╣╣╣╣╣╣╬ ╟╣╣╬ ╟╣╣╣ ╒╣╣╖╗╣╣╣╗ ╣╣╣ ╣╣╣╣╣╣ ╟╣╣╖ ╣╣╣

    ╬╬╬╬┐ ╙╬╬╬╬│╓╣╣╣╝╜ ╫╣╣╣╬ ╟╣╣╬ ╟╣╣╣ ╟╣╣╣╙ ╙╣╣╣ ╣╣╣ ╙╟╣╣╜╙ ╫╣╣ ╟╣╣

    ╬╬╬╬┐ ╙╬╬╣╣ ╫╣╣╣╬ ╟╣╣╬ ╟╣╣╣ ╟╣╣╬ ╣╣╣ ╣╣╣ ╟╣╣ ╣╣╣┌╣╣╜

    ╬╬╬╜ ╬╬╣╣ ╙╝╣╣╬ ╙╣╣╣╗╖╓╗╣╣╣╜ ╟╣╣╬ ╣╣╣ ╣╣╣ ╟╣╣╦╓ ╣╣╣╣╣

    ╙ ╓╦╖ ╬╬╣╣ ╓╗╗╖ ╙╝╣╣╣╣╝╜ ╘╝╝╜ ╝╝╝ ╝╝╝ ╙╣╣╣ ╟╣╣╣

    ╩╬╬╬╬╬╬╦╦╬╬╣╣╗╣╣╣╣╣╣╣╝ ╫╣╣╣╣

    ╙╬╬╬╬╬╬╬╣╣╣╣╣╣╝╜

    ╙╬╬╬╣╣╣╜





    Version information:

    ml-agents: 0.30.0,

    ml-agents-envs: 0.30.0,

    Communicator API: 1.5.0,

    PyTorch: 1.8.1

    [WARNING] PyTorch checkpoint was saved with a different version of PyTorch. Model may not resume properly.

    [INFO] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.

    [INFO] Connected to Unity environment with package version 2.3.0-exp.4 and communication version 1.5.0

    [INFO] Connected new brain: 3DBall?team=0

    zsh: illegal hardware instruction mlagents-learn ppo/3DBall.yaml --resume --run-id=3

    (rl) robinroypeter@robinroys-MacBook-Pro config % [ERROR] UnityEnvironment worker 0: environment raised an unexpected exception.

    Traceback (most recent call last):

    File "/Users/robinroypeter/opt/anaconda3/envs/rl/lib/python3.9/site-packages/mlagents/trainers/subprocess_env_manager.py", line 175, in worker

    req: EnvironmentRequest = parent_conn.recv()

    File "/Users/robinroypeter/opt/anaconda3/envs/rl/lib/python3.9/multiprocessing/connection.py", line 255, in recv

    buf = self._recv_bytes()

    File "/Users/robinroypeter/opt/anaconda3/envs/rl/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes

    buf = self._recv(4)

    File "/Users/robinroypeter/opt/anaconda3/envs/rl/lib/python3.9/multiprocessing/connection.py", line 388, in _recv

    raise EOFError

    EOFError

    Process Process-1:

    Traceback (most recent call last):

    File "/Users/robinroypeter/opt/anaconda3/envs/rl/lib/python3.9/site-packages/mlagents/trainers/subprocess_env_manager.py", line 175, in worker

    req: EnvironmentRequest = parent_conn.recv()

    File "/Users/robinroypeter/opt/anaconda3/envs/rl/lib/python3.9/multiprocessing/connection.py", line 255, in recv

    buf = self._recv_bytes()

    File "/Users/robinroypeter/opt/anaconda3/envs/rl/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes

    buf = self._recv(4)

    File "/Users/robinroypeter/opt/anaconda3/envs/rl/lib/python3.9/multiprocessing/connection.py", line 388, in _recv

    raise EOFError

    EOFError


    During handling of the above exception, another exception occurred:


    Traceback (most recent call last):

    File "/Users/robinroypeter/opt/anaconda3/envs/rl/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap

    self.run()

    File "/Users/robinroypeter/opt/anaconda3/envs/rl/lib/python3.9/multiprocessing/process.py", line 108, in run

    self._target(*self._args, **self._kwargs)

    File "/Users/robinroypeter/opt/anaconda3/envs/rl/lib/python3.9/site-packages/mlagents/trainers/subprocess_env_manager.py", line 235, in worker

    _send_response(EnvironmentCommand.ENV_EXITED, ex)

    File "/Users/robinroypeter/opt/anaconda3/envs/rl/lib/python3.9/site-packages/mlagents/trainers/subprocess_env_manager.py", line 150, in _send_response

    parent_conn.send(EnvironmentResponse(cmd_name, worker_id, payload))

    File "/Users/robinroypeter/opt/anaconda3/envs/rl/lib/python3.9/multiprocessing/connection.py", line 211, in send

    self._send_bytes(_ForkingPickler.dumps(obj))

    File "/Users/robinroypeter/opt/anaconda3/envs/rl/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes

    self._send(header + buf)

    File "/Users/robinroypeter/opt/anaconda3/envs/rl/lib/python3.9/multiprocessing/connection.py", line 373, in _send

    n = write(self._handle, buf)

    BrokenPipeError: [Errno 32] Broken pipe

    /Users/robinroypeter/opt/anaconda3/envs/rl/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown

    warnings.warn('resource_tracker: There appear to be %d '
     
  7. hughperkins

    hughperkins

    Joined:
    Dec 3, 2022
    Posts:
    191
    Have you tried the same version of pytorch as the model so it doesnt complain "model was trained with a different version of pytorch and may not resume correctly"? Have you tried training from scratch, without resume?
     
  8. hughperkins

    hughperkins

    Joined:
    Dec 3, 2022
    Posts:
    191
    (3d ball trains in a few minutes, no need to do a 'resume', I reckon?) (edit: especially on an M1 pro. I'm on an M2 non-pro, and it trains in minutes)