Search Unity

  1. Good news ✨ We have more Unite Now videos available for you to watch on-demand! Come check them out and ask our experts any questions!
    Dismiss Notice
  2. Ever participated in one our Game Jams? Want pointers on your project? Our Evangelists will be available on Friday to give feedback. Come share your games with us!
    Dismiss Notice

Wall Jump Example Stops Training Abruptly

Discussion in 'ML-Agents' started by ee19b131, May 30, 2020.

  1. ee19b131

    ee19b131

    Joined:
    Apr 11, 2020
    Posts:
    2
    Description
    When I run training for the Wall Jump Example in the ml-agents-release1 folder,

    mlagents-learn config/trainer_config.yaml --run-id=WallJump2 --force

    and press the play button, the training starts like usual, but everything comes to a stop in about 30 seconds. The agent is floating in midair, the Unity window stops responding, and the Command Prompt does not have any more output. 40% CPU usage is taken by a Python process during this period. Ctrl-C in the Command Prompt causes the Unity window to unfreeze, but the Python process still runs in the Command Prompt (consuming 40% still). The last line in the CMD output is after I do the Ctrl-C. I have to end the process from Task Manager for it to stop.
    Any idea what could be going wrong? I use MLAgents release 1 as downloaded from the GitHub page.

    Versions
    Unity: 2019.3.13f1
    Python: 3.7.7
    ml-agents: 0.16.0,
    ml-agents-envs: 0.16.0,
    Communicator API: 1.0.0,
    TensorFlow: 2.1.0

    CMD output
    mlagents-learn config/trainer_config.yaml --run-id=WallJump2 --force
    2020-05-30 21:29:10.318307: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
    WARNING:tensorflow:From C:\Users\nihal\anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_core\python\compat\v2_compat.py:88: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
    Instructions for updating:
    non-resource variables are not supported in the long term
    ▄▄▄▓▓▓▓
    ╓▓▓▓▓▓▓█▓▓▓▓▓
    ,▄▄▄m▀▀▀' ,▓▓▓▀▓▓▄ ▓▓▓ ▓▓▌
    ▄▓▓▓▀' ▄▓▓▀ ▓▓▓ ▄▄ ▄▄ ,▄▄ ▄▄▄▄ ,▄▄ ▄▓▓▌▄ ▄▄▄ ,▄▄
    ▄▓▓▓▀ ▄▓▓▀ ▐▓▓▌ ▓▓▌ ▐▓▓ ▐▓▓▓▀▀▀▓▓▌ ▓▓▓ ▀▓▓▌▀ ^▓▓▌ ╒▓▓▌
    ▄▓▓▓▓▓▄▄▄▄▄▄▄▄▓▓▓ ▓▀ ▓▓▌ ▐▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▌ ▐▓▓▄ ▓▓▌
    ▀▓▓▓▓▀▀▀▀▀▀▀▀▀▀▓▓▄ ▓▓ ▓▓▌ ▐▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▌ ▐▓▓▐▓▓
    ^█▓▓▓ ▀▓▓▄ ▐▓▓▌ ▓▓▓▓▄▓▓▓▓ ▐▓▓ ▓▓▓ ▓▓▓ ▓▓▓▄ ▓▓▓▓`
    '▀▓▓▓▄ ^▓▓▓ ▓▓▓ └▀▀▀▀ ▀▀ ^▀▀ `▀▀ `▀▀ '▀▀ ▐▓▓▌
    ▀▀▀▀▓▄▄▄ ▓▓▓▓▓▓, ▓▓▓▓▀
    `▀█▓▓▓▓▓▓▓▓▓▌
    ¬`▀▀▀█▓
    Version information:
    ml-agents: 0.16.0,
    ml-agents-envs: 0.16.0,
    Communicator API: 1.0.0,
    TensorFlow: 2.1.0
    2020-05-30 21:29:13.318042: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
    WARNING:tensorflow:From C:\Users\nihal\anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_core\python\compat\v2_compat.py:88: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
    Instructions for updating:
    non-resource variables are not supported in the long term
    2020-05-30 21:29:15 INFO [environment.py:201] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.
    2020-05-30 21:29:20 INFO [environment.py:111] Connected to Unity environment with package version 1.0.0-preview and communication version 1.0.0
    2020-05-30 21:29:20 INFO [environment.py:342] Connected new brain:
    SmallWallJump?team=0
    2020-05-30 21:29:20.729678: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
    2020-05-30 21:29:20.739893: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
    2020-05-30 21:29:20.776772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
    pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
    coreClock: 1.59GHz coreCount: 24 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 268.26GiB/s
    2020-05-30 21:29:20.786215: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
    2020-05-30 21:29:20.795884: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
    2020-05-30 21:29:20.805575: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
    2020-05-30 21:29:20.812470: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
    2020-05-30 21:29:20.825955: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
    2020-05-30 21:29:20.834143: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
    2020-05-30 21:29:20.846617: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
    2020-05-30 21:29:20.853374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
    2020-05-30 21:29:21.471889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
    2020-05-30 21:29:21.477498: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
    2020-05-30 21:29:21.481310: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
    2020-05-30 21:29:21.485303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4625 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
    2020-05-30 21:29:21 WARNING [stats.py:197] events.out.tfevents.1590851932.LAPTOP-3CHHMIT0 was left over from a previous run. Deleting.
    2020-05-30 21:29:21 WARNING [stats.py:197] events.out.tfevents.1590851971.LAPTOP-3CHHMIT0 was left over from a previous run. Deleting.
    2020-05-30 21:29:21 INFO [stats.py:130] Hyperparameters for behavior name WallJump2_SmallWallJump:
    trainer: ppo
    batch_size: 128
    beta: 0.005
    buffer_size: 2048
    epsilon: 0.2
    hidden_units: 256
    lambd: 0.95
    learning_rate: 0.0003
    learning_rate_schedule: linear
    max_steps: 5e6
    memory_size: 128
    normalize: False
    num_epoch: 3
    num_layers: 2
    time_horizon: 128
    sequence_length: 64
    summary_freq: 20000
    use_recurrent: False
    vis_encode_type: simple
    reward_signals:
    extrinsic:
    strength: 1.0
    gamma: 0.99
    summary_path: WallJump2_SmallWallJump
    model_path: ./models/WallJump2/SmallWallJump
    keep_checkpoints: 5
    2020-05-30 21:29:21.522400: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
    pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
    coreClock: 1.59GHz coreCount: 24 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 268.26GiB/s
    2020-05-30 21:29:21.533581: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
    2020-05-30 21:29:21.538149: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
    2020-05-30 21:29:21.542434: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
    2020-05-30 21:29:21.547138: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
    2020-05-30 21:29:21.551515: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
    2020-05-30 21:29:21.556379: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
    2020-05-30 21:29:21.561277: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
    2020-05-30 21:29:21.566907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
    2020-05-30 21:29:21.569910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
    2020-05-30 21:29:21.574760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
    2020-05-30 21:29:21.577636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
    2020-05-30 21:29:21.580530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4625 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
    2020-05-30 21:29:23.056474: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
    2020-05-30 21:29:23 INFO [environment.py:342] Connected new brain:
    BigWallJump?team=0
    2020-05-30 21:29:23 WARNING [env_manager.py:109] Agent manager was not created for behavior id BigWallJump?team=0.
    2020-05-30 21:29:23.572885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
    pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
    coreClock: 1.59GHz coreCount: 24 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 268.26GiB/s
    2020-05-30 21:29:23.582329: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
    2020-05-30 21:29:23.587127: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
    2020-05-30 21:29:23.591483: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
    2020-05-30 21:29:23.596612: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
    2020-05-30 21:29:23.601308: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
    2020-05-30 21:29:23.606158: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
    2020-05-30 21:29:23.610582: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
    2020-05-30 21:29:23.616124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
    2020-05-30 21:29:23.619079: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
    2020-05-30 21:29:23.623984: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
    2020-05-30 21:29:23.626724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
    2020-05-30 21:29:23.630220: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4625 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
    2020-05-30 21:29:23 WARNING [stats.py:197] events.out.tfevents.1590851934.LAPTOP-3CHHMIT0 was left over from a previous run. Deleting.
    2020-05-30 21:29:23 WARNING [stats.py:197] events.out.tfevents.1590851973.LAPTOP-3CHHMIT0 was left over from a previous run. Deleting.
    2020-05-30 21:29:23 INFO [stats.py:130] Hyperparameters for behavior name WallJump2_BigWallJump:
    trainer: ppo
    batch_size: 128
    beta: 0.005
    buffer_size: 2048
    epsilon: 0.2
    hidden_units: 256
    lambd: 0.95
    learning_rate: 0.0003
    learning_rate_schedule: linear
    max_steps: 2e7
    memory_size: 128
    normalize: False
    num_epoch: 3
    num_layers: 2
    time_horizon: 128
    sequence_length: 64
    summary_freq: 20000
    use_recurrent: False
    vis_encode_type: simple
    reward_signals:
    extrinsic:
    strength: 1.0
    gamma: 0.99
    summary_path: WallJump2_BigWallJump
    model_path: ./models/WallJump2/BigWallJump
    keep_checkpoints: 5
    2020-05-30 21:29:23.658786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
    pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
    coreClock: 1.59GHz coreCount: 24 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 268.26GiB/s
    2020-05-30 21:29:23.668717: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
    2020-05-30 21:29:23.672988: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
    2020-05-30 21:29:23.678131: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
    2020-05-30 21:29:23.682779: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
    2020-05-30 21:29:23.687796: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
    2020-05-30 21:29:23.692050: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
    2020-05-30 21:29:23.697388: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
    2020-05-30 21:29:23.702211: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
    2020-05-30 21:29:23.705757: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
    2020-05-30 21:29:23.710373: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
    2020-05-30 21:29:23.713525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
    2020-05-30 21:29:23.717303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4625 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
    2020-05-30 21:30:12 INFO [subprocess_env_manager.py:191] UnityEnvironment worker 0: environment stopping.
     
  2. TreyK-47

    TreyK-47

    Unity Technologies

    Joined:
    Oct 22, 2019
    Posts:
    508
    I'll flag this for the team to take a look.
     
    ee19b131 likes this.
  3. ee19b131

    ee19b131

    Joined:
    Apr 11, 2020
    Posts:
    2
    So I somehow managed to solve this issue through some reinstallation. Previously, I had downloaded CUDA, CUDNN and tensorflow-gpu through

    conda install tensorflow-gpu

    which automatically gets the correct versions of TF, CUDA and CUDNN. This arrangement has worked well with my other deep learning codes (like MNIST digit recognition).
    This time, I first uninstalled Anaconda (which removed all conda installed packages including CUDA and CUDNN). Then I installed CUDA and CUDNN manually according to the Nvidia website. Then I did

    conda install tensorflow-gpu
    which again downloads CUDA and CUDNN for some reason I don't know, but the versions are exactly the same as my manual CUDA CUDNN install. Then after installing the correct mlagents python package and Unity package, it finally worked.
     
    ervteng_unity likes this.
  4. TreyK-47

    TreyK-47

    Unity Technologies

    Joined:
    Oct 22, 2019
    Posts:
    508
    Thanks for the update! Happy to hear you got it working!
     
    ee19b131 likes this.
unityunity