Search Unity

Wall Jump Example Stops Training Abruptly

Discussion in 'ML-Agents' started by ee19b131, May 30, 2020.

  1. ee19b131

    ee19b131

    Joined:
    Apr 11, 2020
    Posts:
    2
    Description
    When I run training for the Wall Jump Example in the ml-agents-release1 folder,

    mlagents-learn config/trainer_config.yaml --run-id=WallJump2 --force

    and press the play button, the training starts like usual, but everything comes to a stop in about 30 seconds. The agent is floating in midair, the Unity window stops responding, and the Command Prompt does not have any more output. 40% CPU usage is taken by a Python process during this period. Ctrl-C in the Command Prompt causes the Unity window to unfreeze, but the Python process still runs in the Command Prompt (consuming 40% still). The last line in the CMD output is after I do the Ctrl-C. I have to end the process from Task Manager for it to stop.
    Any idea what could be going wrong? I use MLAgents release 1 as downloaded from the GitHub page.

    Versions
    Unity: 2019.3.13f1
    Python: 3.7.7
    ml-agents: 0.16.0,
    ml-agents-envs: 0.16.0,
    Communicator API: 1.0.0,
    TensorFlow: 2.1.0

    CMD output
    mlagents-learn config/trainer_config.yaml --run-id=WallJump2 --force
    2020-05-30 21:29:10.318307: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
    WARNING:tensorflow:From C:\Users\nihal\anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_core\python\compat\v2_compat.py:88: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
    Instructions for updating:
    non-resource variables are not supported in the long term
    ▄▄▄▓▓▓▓
    ╓▓▓▓▓▓▓█▓▓▓▓▓
    ,▄▄▄m▀▀▀' ,▓▓▓▀▓▓▄ ▓▓▓ ▓▓▌
    ▄▓▓▓▀' ▄▓▓▀ ▓▓▓ ▄▄ ▄▄ ,▄▄ ▄▄▄▄ ,▄▄ ▄▓▓▌▄ ▄▄▄ ,▄▄
    ▄▓▓▓▀ ▄▓▓▀ ▐▓▓▌ ▓▓▌ ▐▓▓ ▐▓▓▓▀▀▀▓▓▌ ▓▓▓ ▀▓▓▌▀ ^▓▓▌ ╒▓▓▌
    ▄▓▓▓▓▓▄▄▄▄▄▄▄▄▓▓▓ ▓▀ ▓▓▌ ▐▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▌ ▐▓▓▄ ▓▓▌
    ▀▓▓▓▓▀▀▀▀▀▀▀▀▀▀▓▓▄ ▓▓ ▓▓▌ ▐▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▌ ▐▓▓▐▓▓
    ^█▓▓▓ ▀▓▓▄ ▐▓▓▌ ▓▓▓▓▄▓▓▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▓▄ ▓▓▓▓`
    '▀▓▓▓▄ ^▓▓▓ ▓▓▓ └▀▀▀▀ ▀▀ ^▀▀ `▀▀ `▀▀ '▀▀ ▐▓▓▌
    ▀▀▀▀▓▄▄▄ ▓▓▓▓▓▓, ▓▓▓▓▀
    `▀█▓▓▓▓▓▓▓▓▓▌
    ¬`▀▀▀█▓
    Version information:
    ml-agents: 0.16.0,
    ml-agents-envs: 0.16.0,
    Communicator API: 1.0.0,
    TensorFlow: 2.1.0
    2020-05-30 21:29:13.318042: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
    WARNING:tensorflow:From C:\Users\nihal\anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_core\python\compat\v2_compat.py:88: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
    Instructions for updating:
    non-resource variables are not supported in the long term
    2020-05-30 21:29:15 INFO [environment.py:201] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.
    2020-05-30 21:29:20 INFO [environment.py:111] Connected to Unity environment with package version 1.0.0-preview and communication version 1.0.0
    2020-05-30 21:29:20 INFO [environment.py:342] Connected new brain:
    SmallWallJump?team=0
    2020-05-30 21:29:20.729678: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
    2020-05-30 21:29:20.739893: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
    2020-05-30 21:29:20.776772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
    pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
    coreClock: 1.59GHz coreCount: 24 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 268.26GiB/s
    2020-05-30 21:29:20.786215: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
    2020-05-30 21:29:20.795884: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
    2020-05-30 21:29:20.805575: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
    2020-05-30 21:29:20.812470: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
    2020-05-30 21:29:20.825955: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
    2020-05-30 21:29:20.834143: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
    2020-05-30 21:29:20.846617: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
    2020-05-30 21:29:20.853374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
    2020-05-30 21:29:21.471889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
    2020-05-30 21:29:21.477498: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
    2020-05-30 21:29:21.481310: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
    2020-05-30 21:29:21.485303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4625 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
    2020-05-30 21:29:21 WARNING [stats.py:197] events.out.tfevents.1590851932.LAPTOP-3CHHMIT0 was left over from a previous run. Deleting.
    2020-05-30 21:29:21 WARNING [stats.py:197] events.out.tfevents.1590851971.LAPTOP-3CHHMIT0 was left over from a previous run. Deleting.
    2020-05-30 21:29:21 INFO [stats.py:130] Hyperparameters for behavior name WallJump2_SmallWallJump:
    trainer: ppo
    batch_size: 128
    beta: 0.005
    buffer_size: 2048
    epsilon: 0.2
    hidden_units: 256
    lambd: 0.95
    learning_rate: 0.0003
    learning_rate_schedule: linear
    max_steps: 5e6
    memory_size: 128
    normalize: False
    num_epoch: 3
    num_layers: 2
    time_horizon: 128
    sequence_length: 64
    summary_freq: 20000
    use_recurrent: False
    vis_encode_type: simple
    reward_signals:
    extrinsic:
    strength: 1.0
    gamma: 0.99
    summary_path: WallJump2_SmallWallJump
    model_path: ./models/WallJump2/SmallWallJump
    keep_checkpoints: 5
    2020-05-30 21:29:21.522400: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
    pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
    coreClock: 1.59GHz coreCount: 24 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 268.26GiB/s
    2020-05-30 21:29:21.533581: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
    2020-05-30 21:29:21.538149: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
    2020-05-30 21:29:21.542434: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
    2020-05-30 21:29:21.547138: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
    2020-05-30 21:29:21.551515: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
    2020-05-30 21:29:21.556379: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
    2020-05-30 21:29:21.561277: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
    2020-05-30 21:29:21.566907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
    2020-05-30 21:29:21.569910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
    2020-05-30 21:29:21.574760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
    2020-05-30 21:29:21.577636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
    2020-05-30 21:29:21.580530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4625 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
    2020-05-30 21:29:23.056474: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
    2020-05-30 21:29:23 INFO [environment.py:342] Connected new brain:
    BigWallJump?team=0
    2020-05-30 21:29:23 WARNING [env_manager.py:109] Agent manager was not created for behavior id BigWallJump?team=0.
    2020-05-30 21:29:23.572885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
    pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
    coreClock: 1.59GHz coreCount: 24 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 268.26GiB/s
    2020-05-30 21:29:23.582329: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
    2020-05-30 21:29:23.587127: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
    2020-05-30 21:29:23.591483: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
    2020-05-30 21:29:23.596612: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
    2020-05-30 21:29:23.601308: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
    2020-05-30 21:29:23.606158: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
    2020-05-30 21:29:23.610582: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
    2020-05-30 21:29:23.616124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
    2020-05-30 21:29:23.619079: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
    2020-05-30 21:29:23.623984: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
    2020-05-30 21:29:23.626724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
    2020-05-30 21:29:23.630220: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4625 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
    2020-05-30 21:29:23 WARNING [stats.py:197] events.out.tfevents.1590851934.LAPTOP-3CHHMIT0 was left over from a previous run. Deleting.
    2020-05-30 21:29:23 WARNING [stats.py:197] events.out.tfevents.1590851973.LAPTOP-3CHHMIT0 was left over from a previous run. Deleting.
    2020-05-30 21:29:23 INFO [stats.py:130] Hyperparameters for behavior name WallJump2_BigWallJump:
    trainer: ppo
    batch_size: 128
    beta: 0.005
    buffer_size: 2048
    epsilon: 0.2
    hidden_units: 256
    lambd: 0.95
    learning_rate: 0.0003
    learning_rate_schedule: linear
    max_steps: 2e7
    memory_size: 128
    normalize: False
    num_epoch: 3
    num_layers: 2
    time_horizon: 128
    sequence_length: 64
    summary_freq: 20000
    use_recurrent: False
    vis_encode_type: simple
    reward_signals:
    extrinsic:
    strength: 1.0
    gamma: 0.99
    summary_path: WallJump2_BigWallJump
    model_path: ./models/WallJump2/BigWallJump
    keep_checkpoints: 5
    2020-05-30 21:29:23.658786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
    pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
    coreClock: 1.59GHz coreCount: 24 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 268.26GiB/s
    2020-05-30 21:29:23.668717: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
    2020-05-30 21:29:23.672988: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
    2020-05-30 21:29:23.678131: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
    2020-05-30 21:29:23.682779: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
    2020-05-30 21:29:23.687796: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
    2020-05-30 21:29:23.692050: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
    2020-05-30 21:29:23.697388: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
    2020-05-30 21:29:23.702211: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
    2020-05-30 21:29:23.705757: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
    2020-05-30 21:29:23.710373: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
    2020-05-30 21:29:23.713525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
    2020-05-30 21:29:23.717303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4625 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
    2020-05-30 21:30:12 INFO [subprocess_env_manager.py:191] UnityEnvironment worker 0: environment stopping.
     
  2. TreyK-47

    TreyK-47

    Unity Technologies

    Joined:
    Oct 22, 2019
    Posts:
    1,820
    I'll flag this for the team to take a look.
     
    ee19b131 likes this.
  3. ee19b131

    ee19b131

    Joined:
    Apr 11, 2020
    Posts:
    2
    So I somehow managed to solve this issue through some reinstallation. Previously, I had downloaded CUDA, CUDNN and tensorflow-gpu through

    conda install tensorflow-gpu

    which automatically gets the correct versions of TF, CUDA and CUDNN. This arrangement has worked well with my other deep learning codes (like MNIST digit recognition).
    This time, I first uninstalled Anaconda (which removed all conda installed packages including CUDA and CUDNN). Then I installed CUDA and CUDNN manually according to the Nvidia website. Then I did

    conda install tensorflow-gpu
    which again downloads CUDA and CUDNN for some reason I don't know, but the versions are exactly the same as my manual CUDA CUDNN install. Then after installing the correct mlagents python package and Unity package, it finally worked.
     
    ervteng_unity likes this.
  4. TreyK-47

    TreyK-47

    Unity Technologies

    Joined:
    Oct 22, 2019
    Posts:
    1,820
    Thanks for the update! Happy to hear you got it working!
     
    ee19b131 likes this.