Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice
  2. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

Question Problems with versions

Discussion in 'ML-Agents' started by eseller, Apr 23, 2021.

  1. eseller

    eseller

    Joined:
    Nov 30, 2019
    Posts:
    3
    Hi, I need help. I am developing a Shoot'em'up game AI with MLAgents. I have carefully followed the unity learn tutorial (https://learn.unity.com/course/ml-agents-hummingbirds?uv=2019.3) and am now developing my agents. The version of unity mlagents package is 1.0.7, while the automatically installed version (pip install mlagents) is 0.25.0.

    I tried to use the self play feature, but it doesn't work (I have set up the two different teams and adjusted my trainer_config.yaml, but when I start the training it doesn't give me information on self play and I have no details on the elo in tensorflow). Also, when I try to use the ONNX file, unity reports me the error "onnx import exception: unexpected error while parsing layer 38 of type Gemm".

    I suspect the problem is a version incompatibility, because I get a warning when I start the training. Following this post (https://github.com/Unity-Technologies/ml-agents/issues/4710) I tried to downgrade mlagents on the python side (pip install mlagents == 0.20.0) but now I get the error "mlagents.trainers.exception.TrainerConfigError: The option network_settings was specified in your YAML file for RewardSignalSettings, but is invalid."

    Now I suspect the problem is in the YAML file in the new format (I read something in the lesson comments here https://learn.unity.com/tutorial/trainer-config-yaml?uv=2019.3) but I have no idea how to get back to the old format while keeping the self play, or which preview version of mlagents to install in Unity to support the new format without running into other problems. I attach my trainer_config.yaml file:

    Code (CSharp):
    1. behaviors:
    2.   EnemyWeapon:
    3.     trainer_type: ppo
    4.     hyperparameters:
    5.       batch_size: 2048
    6.       buffer_size: 20480
    7.       learning_rate: 0.0003
    8.       beta: 0.005
    9.       epsilon: 0.2
    10.       lambd: 0.95
    11.       num_epoch: 3
    12.       learning_rate_schedule: linear
    13.     network_settings:
    14.       normalize: false
    15.       hidden_units: 1024
    16.       num_layers: 4
    17.       vis_encode_type: simple
    18.     reward_signals:
    19.       extrinsic:
    20.         gamma: 0.99
    21.         strength: 1.0
    22.         network_settings:
    23.           normalize: false
    24.           hidden_units: 128
    25.           num_layers: 2
    26.           vis_encode_type: simple
    27.     keep_checkpoints: 5
    28.     checkpoint_interval: 500000
    29.     max_steps: 50000000
    30.     time_horizon: 128
    31.     summary_freq: 10000
    32.     threaded: true
    33.     self_play:
    34.       save_steps: 10000
    35.       team_change: 10000
    36.       swap_steps: 10000
    37.       window: 20
    38.       play_against_latest_model_ratio: 0.5
    39.       initial_elo: 1200.0
    Anyone have any suggestions for me on this?
    (Sorry for my English)
     
  2. celion_unity

    celion_unity

    Joined:
    Jun 12, 2019
    Posts:
    289
    Hi @eseller -
    No problem, your English is great :)

    You should just need to remove the "network_settings:" section underneath the "reward_signals:" section - lines 22 -26 in the file you pasted above. I tried this with version 0.20.0 and it can at least start training.

    In case your wondering what's going on, in a recent version, we added the option to customize the network for the reward signals separately from the "main" network. But in older versions, this wasn't an option, so the new config isn't valid for them.
     
    fazil47 likes this.
  3. eseller

    eseller

    Joined:
    Nov 30, 2019
    Posts:
    3
    Thank you very much, it seems to work.
    But when I try to resume a stopped training (mlagents-learn ./trainer_config.yaml --run-id mibo_05 --resume) I receive this error:


    Code (CSharp):
    1. 2021-04-28 23:07:36.593726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0
    2. 2021-04-28 23:07:36.594232: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N
    3. 2021-04-28 23:07:38 INFO [tf_model_saver.py:105] Loading model from results\mibo_05\EnemyWeapon.
    4. 2021-04-28 23:07:48 INFO [model_serialization.py:205] List of nodes to export for behavior :EnemyWeapon
    5. 2021-04-28 23:07:48 INFO [model_serialization.py:207]   is_continuous_control
    6. 2021-04-28 23:07:48 INFO [model_serialization.py:207]   trainer_major_version
    7. 2021-04-28 23:07:48 INFO [model_serialization.py:207]   trainer_minor_version
    8. 2021-04-28 23:07:48 INFO [model_serialization.py:207]   trainer_patch_version
    9. 2021-04-28 23:07:48 INFO [model_serialization.py:207]   version_number
    10. 2021-04-28 23:07:48 INFO [model_serialization.py:207]   memory_size
    11. 2021-04-28 23:07:48 INFO [model_serialization.py:207]   action_output_shape
    12. 2021-04-28 23:07:48 INFO [model_serialization.py:207]   action
    13. 2021-04-28 23:07:48 INFO [model_serialization.py:207]   action_probs
    14. Converting results\mibo_05\EnemyWeapon/frozen_graph_def.pb to results\mibo_05\EnemyWeapon\EnemyWeapon-738840.nn
    15. IGNORED: Shape unknown layer
    16. IGNORED: StopGradient unknown layer
    17. GLOBALS: 'is_continuous_control', 'trainer_major_version', 'trainer_minor_version', 'trainer_patch_version', 'version_number', 'memory_size', 'action_output_shape'
    18. IN: 'visual_observation_0': [-1, 84, 84, 3] => 'policy/main_graph_0_encoder0/conv_1/BiasAdd'
    19. IN: 'vector_observation': [-1, 1, 1, 59] => 'policy/main_graph_0/hidden_0/BiasAdd'
    20. OUT: 'policy/concat/concat', 'action', 'action_probs'
    21. DONE: wrote results\mibo_05\EnemyWeapon\EnemyWeapon-738840.nn file.
    22. 2021-04-28 23:07:54 INFO [model_serialization.py:87] Exported results\mibo_05\EnemyWeapon\EnemyWeapon-738840.nn
    23. 2021-04-28 23:07:54 INFO [tf_model_saver.py:163] Copied results\mibo_05\EnemyWeapon\EnemyWeapon-738840.nn to results\mibo_05\EnemyWeapon.nn.
    24. 2021-04-28 23:07:54 INFO [trainer_controller.py:84] Saved Model
    25. Traceback (most recent call last):
    26.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\tensorflow\python\client\session.py", line 1121, in _run
    27.     subfeed, allow_tensor=True, allow_operation=False)
    28.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\tensorflow\python\framework\ops.py", line 3670, in as_graph_element
    29.     return self._as_graph_element_locked(obj, allow_tensor, allow_operation)
    30.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\tensorflow\python\framework\ops.py", line 3712, in _as_graph_element_locked
    31.     "graph." % (repr(name), repr(op_name)))
    32. KeyError: "The name 'save/Const:0' refers to a Tensor which does not exist. The operation, 'save/Const', does not exist in the graph."
    33.  
    34. During handling of the above exception, another exception occurred:
    35.  
    36. Traceback (most recent call last):
    37.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\runpy.py", line 193, in _run_module_as_main
    38.     "__main__", mod_spec)
    39.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\runpy.py", line 85, in _run_code
    40.     exec(code, run_globals)
    41.   File "D:\ProgramData\Anaconda3\envs\ml-agents-1.0.7\Scripts\mlagents-learn.exe\__main__.py", line 7, in <module>
    42.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents\trainers\learn.py", line 277, in main
    43.     run_cli(parse_command_line())
    44.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents\trainers\learn.py", line 273, in run_cli
    45.     run_training(run_seed, options)
    46.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents\trainers\learn.py", line 150, in run_training
    47.     tc.start_learning(env_manager)
    48.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
    49.     return func(*args, **kwargs)
    50.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents\trainers\trainer_controller.py", line 172, in start_learning
    51.     self._reset_env(env_manager)
    52.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
    53.     return func(*args, **kwargs)
    54.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents\trainers\trainer_controller.py", line 110, in _reset_env
    55.     self._register_new_behaviors(env_manager, env_manager.first_step_infos)
    56.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents\trainers\trainer_controller.py", line 266, in _register_new_behaviors
    57.     self._create_trainers_and_managers(env_manager, new_behavior_ids)
    58.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents\trainers\trainer_controller.py", line 164, in _create_trainers_and_managers
    59.     self._create_trainer_and_manager(env_manager, behavior_id)
    60.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents\trainers\trainer_controller.py", line 138, in _create_trainer_and_manager
    61.     parsed_behavior_id, env_manager.training_behaviors[name_behavior_id]
    62.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents\trainers\ghost\trainer.py", line 322, in create_policy
    63.     self.trainer.model_saver.initialize_or_load(policy)
    64.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents\trainers\model_saver\tf_model_saver.py", line 94, in initialize_or_load
    65.     self._load_graph(policy, self.model_path, reset_global_steps=reset_steps)
    66.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents\trainers\model_saver\tf_model_saver.py", line 116, in _load_graph
    67.     self.tf_saver.restore(policy.sess, ckpt.model_checkpoint_path)
    68.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\tensorflow\python\training\saver.py", line 1299, in restore
    69.     {self.saver_def.filename_tensor_name: save_path})
    70.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\tensorflow\python\client\session.py", line 958, in run
    71.     run_metadata_ptr)
    72.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\tensorflow\python\client\session.py", line 1124, in _run
    73.     e.args[0])
    74. TypeError: Cannot interpret feed_dict key as Tensor: The name 'save/Const:0' refers to a Tensor which does not exist. The operation, 'save/Const', does not exist in the graph.
    Do you have any suggestion?
     
  4. celion_unity

    celion_unity

    Joined:
    Jun 12, 2019
    Posts:
    289
    Which version of the python library were you using? Was the checkpoint generated from the same version? (there's a check for this but I don't see the warning in your logs). Did you change the config file between the first and second training runs? Does training work without --resume?
     
  5. eseller

    eseller

    Joined:
    Nov 30, 2019
    Posts:
    3
    I'm using 0.20.0. I trained model for hours, then stopped.
    Then i tried to resume and it gave me this error.
    The first training went well, I think. I haven't seen any errors other than cudart64_101.dll not found

    I have not changed the trainer config between the first run and the "resume run" :
    Code (CSharp):
    1. behaviors:
    2.   EnemyWeapon:
    3.     trainer_type: ppo
    4.     hyperparameters:
    5.       batch_size: 2048
    6.       buffer_size: 20480
    7.       learning_rate: 0.0003
    8.       beta: 0.005
    9.       epsilon: 0.2
    10.       lambd: 0.95
    11.       num_epoch: 3
    12.       learning_rate_schedule: linear
    13.     network_settings:
    14.       normalize: false
    15.       hidden_units: 1024
    16.       num_layers: 4
    17.       vis_encode_type: simple
    18.     reward_signals:
    19.       extrinsic:
    20.         gamma: 0.99
    21.         strength: 1.0
    22.     keep_checkpoints: 5
    23.     checkpoint_interval: 500000
    24.     max_steps: 50000000
    25.     time_horizon: 128
    26.     summary_freq: 10000
    27.     threaded: true
    28.     self_play:
    29.       save_steps: 40000
    30.       team_change: 40000
    31.       swap_steps: 10000
    32.       window: 10
    33.       play_against_latest_model_ratio: 0.5
    34.       initial_elo: 1200.0

    This is the full console log (from the very beginning) of the --resume run:
    Code (CSharp):
    1. (ml-agents-1.0.7) D:\Users\emanu\Workspace\GAMING_MIBO\TrainerConfig>mlagents-learn ./trainer_config.yaml --run-id mibo_05 --resume
    2. 2021-04-30 14:53:29.774478: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
    3. 2021-04-30 14:53:29.774603: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
    4. WARNING:tensorflow:From d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
    5. Instructions for updating:
    6. non-resource variables are not supported in the long term
    7.  
    8.  
    9.                         ▄▄▄▓▓▓▓
    10.                    ╓▓▓▓▓▓▓█▓▓▓▓▓
    11.               ,▄▄▄m▀▀▀'  ,▓▓▓▀▓▓▄                           ▓▓▓  ▓▓▌
    12.            ▄▓▓▓▀'      ▄▓▓▀  ▓▓▓      ▄▄     ▄▄ ,▄▄ ▄▄▄▄   ,▄▄ ▄▓▓▌▄ ▄▄▄    ,▄▄
    13.           ▄▓▓▓▀        ▄▓▓▀   ▐▓▓▌     ▓▓▌   ▐▓▓ ▐▓▓▓▀▀▀▓▓▌ ▓▓▓ ▀▓▓▌▀ ^▓▓▌  ╒▓▓▌
    14.         ▄▓▓▓▓▓▄▄▄▄▄▄▄▄▓▓▓      ▓▀      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌   ▐▓▓▄ ▓▓▌
    15.         ▀▓▓▓▓▀▀▀▀▀▀▀▀▀▀▓▓▄     ▓▓      ▓▓▌   ▐▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▌    ▐▓▓▐▓▓
    16.           ^█▓▓▓        ▀▓▓▄   ▐▓▓▌     ▓▓▓▓▄▓▓▓▓ ▐▓▓    ▓▓▓ ▓▓▓  ▓▓▓▄    ▓▓▓▓`
    17.             '▀▓▓▓▄      ^▓▓▓  ▓▓▓       └▀▀▀▀ ▀▀ ^▀▀    `▀▀ `▀▀   '▀▀    ▐▓▓▌
    18.                ▀▀▀▀▓▄▄▄   ▓▓▓▓▓▓,                                      ▓▓▓▓▀
    19.                    `▀█▓▓▓▓▓▓▓▓▓▌
    20.                         ¬`▀▀▀█▓
    21.  
    22.  
    23. Version information:
    24.   ml-agents: 0.20.0,
    25.   ml-agents-envs: 0.20.0,
    26.   Communicator API: 1.1.0,
    27.   TensorFlow: 2.3.2
    28. 2021-04-30 14:53:32 INFO [learn.py:272] run_seed set to 302
    29. 2021-04-30 14:53:33.403792: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
    30. 2021-04-30 14:53:33.404056: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
    31. WARNING:tensorflow:From d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
    32. Instructions for updating:
    33. non-resource variables are not supported in the long term
    34. 2021-04-30 14:53:35 INFO [environment.py:203] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.
    35. 2021-04-30 14:53:44 WARNING [environment.py:103] WARNING: The communication API versions between Unity and python differ at the minor version level. Python API: 1.1.0, Unity API: 1.0.
    36. This means that some features may not work unless you upgrade the package with the lower version.Please find the versions that work best together from our release page.
    37. https://github.com/Unity-Technologies/ml-agents/releases
    38. 2021-04-30 14:53:45 INFO [environment.py:269] Connected new brain:
    39. EnemyWeapon?team=1
    40. 2021-04-30 14:53:45 INFO [environment.py:269] Connected new brain:
    41. EnemyWeapon?team=0
    42. 2021-04-30 14:53:45.348038: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2
    43. To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
    44. 2021-04-30 14:53:45.742543: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x15b23531030 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
    45. 2021-04-30 14:53:45.742901: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
    46. 2021-04-30 14:53:45.855084: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library nvcuda.dll
    47. 2021-04-30 14:53:45.963813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
    48. pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5
    49. coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 312.97GiB/s
    50. 2021-04-30 14:53:45.965270: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
    51. 2021-04-30 14:53:45.966482: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cublas64_10.dll'; dlerror: cublas64_10.dll not found
    52. 2021-04-30 14:53:45.967527: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cufft64_10.dll'; dlerror: cufft64_10.dll not found
    53. 2021-04-30 14:53:45.968672: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'curand64_10.dll'; dlerror: curand64_10.dll not found
    54. 2021-04-30 14:53:45.969737: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cusolver64_10.dll'; dlerror: cusolver64_10.dll not found
    55. 2021-04-30 14:53:45.970774: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cusparse64_10.dll'; dlerror: cusparse64_10.dll not found
    56. 2021-04-30 14:53:45.971778: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudnn64_7.dll'; dlerror: cudnn64_7.dll not found
    57. 2021-04-30 14:53:45.971903: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
    58. Skipping registering GPU devices...
    59. 2021-04-30 14:53:46.171540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
    60. 2021-04-30 14:53:46.171697: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0
    61. 2021-04-30 14:53:46.173253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N
    62. 2021-04-30 14:53:46.232054: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x15b2faacef0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
    63. 2021-04-30 14:53:46.232263: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2060, Compute Capability 7.5
    64. 2021-04-30 14:53:47 INFO [stats.py:126] Hyperparameters for behavior name EnemyWeapon:
    65.         trainer_type:   ppo
    66.         hyperparameters:
    67.           batch_size:   2048
    68.           buffer_size:  20480
    69.           learning_rate:        0.0003
    70.           beta: 0.005
    71.           epsilon:      0.2
    72.           lambd:        0.95
    73.           num_epoch:    3
    74.           learning_rate_schedule:       linear
    75.         network_settings:
    76.           normalize:    False
    77.           hidden_units: 1024
    78.           num_layers:   4
    79.           vis_encode_type:      simple
    80.           memory:       None
    81.         reward_signals:
    82.           extrinsic:
    83.             gamma:      0.99
    84.             strength:   1.0
    85.         init_path:      None
    86.         keep_checkpoints:       5
    87.         checkpoint_interval:    500000
    88.         max_steps:      50000000
    89.         time_horizon:   128
    90.         summary_freq:   10000
    91.         threaded:       True
    92.         self_play:
    93.           save_steps:   40000
    94.           team_change:  40000
    95.           swap_steps:   10000
    96.           window:       10
    97.           play_against_latest_model_ratio:      0.5
    98.           initial_elo:  1200.0
    99.         behavioral_cloning:     None
    100.         framework:      tensorflow
    101. 2021-04-30 14:54:07.259097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
    102. pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5
    103. coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 312.97GiB/s
    104. 2021-04-30 14:54:07.260736: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
    105. 2021-04-30 14:54:07.264512: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cublas64_10.dll'; dlerror: cublas64_10.dll not found
    106. 2021-04-30 14:54:07.266975: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cufft64_10.dll'; dlerror: cufft64_10.dll not found
    107. 2021-04-30 14:54:07.268201: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'curand64_10.dll'; dlerror: curand64_10.dll not found
    108. 2021-04-30 14:54:07.270578: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cusolver64_10.dll'; dlerror: cusolver64_10.dll not found
    109. 2021-04-30 14:54:07.271695: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cusparse64_10.dll'; dlerror: cusparse64_10.dll not found
    110. 2021-04-30 14:54:07.272759: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudnn64_7.dll'; dlerror: cudnn64_7.dll not found
    111. 2021-04-30 14:54:07.273206: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
    112. Skipping registering GPU devices...
    113. 2021-04-30 14:54:07.274157: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
    114. 2021-04-30 14:54:07.274707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0
    115. 2021-04-30 14:54:07.275258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N
    116. 2021-04-30 14:54:08 INFO [tf_model_saver.py:105] Loading model from results\mibo_05\EnemyWeapon.
    117. 2021-04-30 14:54:08 INFO [tf_model_saver.py:134] Resuming training from step 0.
    118. 2021-04-30 14:54:08.310675: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
    119. pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5
    120. coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 312.97GiB/s
    121. 2021-04-30 14:54:08.312497: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
    122. 2021-04-30 14:54:08.313568: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cublas64_10.dll'; dlerror: cublas64_10.dll not found
    123. 2021-04-30 14:54:08.314556: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cufft64_10.dll'; dlerror: cufft64_10.dll not found
    124. 2021-04-30 14:54:08.315571: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'curand64_10.dll'; dlerror: curand64_10.dll not found
    125. 2021-04-30 14:54:08.316573: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cusolver64_10.dll'; dlerror: cusolver64_10.dll not found
    126. 2021-04-30 14:54:08.317563: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cusparse64_10.dll'; dlerror: cusparse64_10.dll not found
    127. 2021-04-30 14:54:08.318581: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudnn64_7.dll'; dlerror: cudnn64_7.dll not found
    128. 2021-04-30 14:54:08.318707: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
    129. Skipping registering GPU devices...
    130. 2021-04-30 14:54:08.319455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
    131. 2021-04-30 14:54:08.319979: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0
    132. 2021-04-30 14:54:08.320590: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N
    133. 2021-04-30 14:54:10 INFO [tf_model_saver.py:105] Loading model from results\mibo_05\EnemyWeapon.
    134. 2021-04-30 14:54:14 INFO [tf_model_saver.py:134] Resuming training from step 738840.
    135. 2021-04-30 14:54:16.456659: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
    136. pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5
    137. coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 312.97GiB/s
    138. 2021-04-30 14:54:16.459253: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
    139. 2021-04-30 14:54:16.460371: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cublas64_10.dll'; dlerror: cublas64_10.dll not found
    140. 2021-04-30 14:54:16.461375: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cufft64_10.dll'; dlerror: cufft64_10.dll not found
    141. 2021-04-30 14:54:16.462377: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'curand64_10.dll'; dlerror: curand64_10.dll not found
    142. 2021-04-30 14:54:16.463718: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cusolver64_10.dll'; dlerror: cusolver64_10.dll not found
    143. 2021-04-30 14:54:16.467230: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cusparse64_10.dll'; dlerror: cusparse64_10.dll not found
    144. 2021-04-30 14:54:16.468992: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudnn64_7.dll'; dlerror: cudnn64_7.dll not found
    145. 2021-04-30 14:54:16.469181: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
    146. Skipping registering GPU devices...
    147. 2021-04-30 14:54:16.470189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
    148. 2021-04-30 14:54:16.470780: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0
    149. 2021-04-30 14:54:16.471399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N
    150. 2021-04-30 14:54:17 INFO [tf_model_saver.py:105] Loading model from results\mibo_05\EnemyWeapon.
    151. 2021-04-30 14:54:22 INFO [model_serialization.py:205] List of nodes to export for behavior :EnemyWeapon
    152. 2021-04-30 14:54:22 INFO [model_serialization.py:207]   is_continuous_control
    153. 2021-04-30 14:54:22 INFO [model_serialization.py:207]   trainer_major_version
    154. 2021-04-30 14:54:22 INFO [model_serialization.py:207]   trainer_minor_version
    155. 2021-04-30 14:54:22 INFO [model_serialization.py:207]   trainer_patch_version
    156. 2021-04-30 14:54:22 INFO [model_serialization.py:207]   version_number
    157. 2021-04-30 14:54:22 INFO [model_serialization.py:207]   memory_size
    158. 2021-04-30 14:54:22 INFO [model_serialization.py:207]   action_output_shape
    159. 2021-04-30 14:54:22 INFO [model_serialization.py:207]   action
    160. 2021-04-30 14:54:22 INFO [model_serialization.py:207]   action_probs
    161. Converting results\mibo_05\EnemyWeapon/frozen_graph_def.pb to results\mibo_05\EnemyWeapon\EnemyWeapon-738840.nn
    162. IGNORED: Shape unknown layer
    163. IGNORED: StopGradient unknown layer
    164. GLOBALS: 'is_continuous_control', 'trainer_major_version', 'trainer_minor_version', 'trainer_patch_version', 'version_number', 'memory_size', 'action_output_shape'
    165. IN: 'visual_observation_0': [-1, 84, 84, 3] => 'policy/main_graph_0_encoder0/conv_1/BiasAdd'
    166. IN: 'vector_observation': [-1, 1, 1, 59] => 'policy/main_graph_0/hidden_0/BiasAdd'
    167. OUT: 'policy/concat/concat', 'action', 'action_probs'
    168. DONE: wrote results\mibo_05\EnemyWeapon\EnemyWeapon-738840.nn file.
    169. 2021-04-30 14:54:25 INFO [model_serialization.py:87] Exported results\mibo_05\EnemyWeapon\EnemyWeapon-738840.nn
    170. 2021-04-30 14:54:25 INFO [tf_model_saver.py:163] Copied results\mibo_05\EnemyWeapon\EnemyWeapon-738840.nn to results\mibo_05\EnemyWeapon.nn.
    171. 2021-04-30 14:54:25 INFO [trainer_controller.py:84] Saved Model
    172. Traceback (most recent call last):
    173.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\tensorflow\python\client\session.py", line 1121, in _run
    174.     subfeed, allow_tensor=True, allow_operation=False)
    175.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\tensorflow\python\framework\ops.py", line 3670, in as_graph_element
    176.     return self._as_graph_element_locked(obj, allow_tensor, allow_operation)
    177.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\tensorflow\python\framework\ops.py", line 3712, in _as_graph_element_locked
    178.     "graph." % (repr(name), repr(op_name)))
    179. KeyError: "The name 'save/Const:0' refers to a Tensor which does not exist. The operation, 'save/Const', does not exist in the graph."
    180.  
    181. During handling of the above exception, another exception occurred:
    182.  
    183. Traceback (most recent call last):
    184.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\runpy.py", line 193, in _run_module_as_main
    185.     "__main__", mod_spec)
    186.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\runpy.py", line 85, in _run_code
    187.     exec(code, run_globals)
    188.   File "D:\ProgramData\Anaconda3\envs\ml-agents-1.0.7\Scripts\mlagents-learn.exe\__main__.py", line 7, in <module>
    189.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents\trainers\learn.py", line 277, in main
    190.     run_cli(parse_command_line())
    191.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents\trainers\learn.py", line 273, in run_cli
    192.     run_training(run_seed, options)
    193.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents\trainers\learn.py", line 150, in run_training
    194.     tc.start_learning(env_manager)
    195.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
    196.     return func(*args, **kwargs)
    197.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents\trainers\trainer_controller.py", line 172, in start_learning
    198.     self._reset_env(env_manager)
    199.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents_envs\timers.py", line 305, in wrapped
    200.     return func(*args, **kwargs)
    201.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents\trainers\trainer_controller.py", line 110, in _reset_env
    202.     self._register_new_behaviors(env_manager, env_manager.first_step_infos)
    203.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents\trainers\trainer_controller.py", line 266, in _register_new_behaviors
    204.     self._create_trainers_and_managers(env_manager, new_behavior_ids)
    205.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents\trainers\trainer_controller.py", line 164, in _create_trainers_and_managers
    206.     self._create_trainer_and_manager(env_manager, behavior_id)
    207.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents\trainers\trainer_controller.py", line 138, in _create_trainer_and_manager
    208.     parsed_behavior_id, env_manager.training_behaviors[name_behavior_id]
    209.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents\trainers\ghost\trainer.py", line 322, in create_policy
    210.     self.trainer.model_saver.initialize_or_load(policy)
    211.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents\trainers\model_saver\tf_model_saver.py", line 94, in initialize_or_load
    212.     self._load_graph(policy, self.model_path, reset_global_steps=reset_steps)
    213.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\mlagents\trainers\model_saver\tf_model_saver.py", line 116, in _load_graph
    214.     self.tf_saver.restore(policy.sess, ckpt.model_checkpoint_path)
    215.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\tensorflow\python\training\saver.py", line 1299, in restore
    216.     {self.saver_def.filename_tensor_name: save_path})
    217.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\tensorflow\python\client\session.py", line 958, in run
    218.     run_metadata_ptr)
    219.   File "d:\programdata\anaconda3\envs\ml-agents-1.0.7\lib\site-packages\tensorflow\python\client\session.py", line 1124, in _run
    220.     e.args[0])
    221. TypeError: Cannot interpret feed_dict key as Tensor: The name 'save/Const:0' refers to a Tensor which does not exist. The operation, 'save/Const', does not exist in the graph.
     
  6. aveakrause

    aveakrause

    Joined:
    Oct 3, 2018
    Posts:
    70
    I'm running into the same error:
    Code (csharp):
    1. TypeError: Cannot interpret feed_dict key as Tensor: The name 'save/Const:0' refers to a Tensor which does not exist. The operation, 'save/Const', does not exist in the graph.
    I'm using self-play to train a chess AI. Attempting to "--resume" training while in self-play gives this error every single time. Obviously this is a huge issue, because it means I can't train my AI more than once... Chess AI can take upwards of 50 million matches to get decent. I can't do that all in 1 sitting, resuming is needed...

    Here's my config yaml (I really have no idea what most of this stuff does, docs seem to be lacking a looooot on this topic)
    Code (csharp):
    1.  
    2. default_settings:
    3.   trainer_type: ppo
    4.   hyperparameters:
    5.     batch_size: 1024
    6.     buffer_size: 10240
    7.     learning_rate: 0.3
    8.     learning_rate_schedule: linear
    9.   network_settings:
    10.     normalize: false
    11.     hidden_units: 128
    12.     num_layers: 2
    13.     vis_encode_type: simple
    14.  
    15.     #memory
    16.     use_recurrent: true
    17.     sequence_length: 64
    18.     memory_size: 256
    19.  
    20.   reward_signals:
    21.     curiosity:
    22.       gamma: 0.99
    23.       strength: 0.02
    24.     extrinsic:
    25.       gamma: 0.99
    26.       strength: 1.0
    27.   init_path: null
    28.   keep_checkpoints: 5
    29.   checkpoint_interval: 500000
    30.   max_steps: 50000000
    31.   time_horizon: 64
    32.   summary_freq: 50000
    33.   threaded: true
    34.   self_play:
    35.     window: 10
    36.     save_steps: 10000
    37.     swap_steps: 10000
    38.     play_against_latest_model_ratio: 0.5
    39.   behavioral_cloning: null
    40.   framework: tensorflow
    41. behaviors: {}
    42. env_settings:
    43.   env_path: null
    44.   env_args: null
    45.   base_port: 5005
    46.   num_envs: 1
    47.   seed: -1
    48. engine_settings:
    49.   width: 84
    50.   height: 84
    51.   quality_level: 5
    52.   time_scale: 1
    53.   target_frame_rate: -1
    54.   capture_frame_rate: 60
    55.   no_graphics: false
    56. environment_parameters: null
    57. checkpoint_settings:
    58.   run_id: 9
    59.   initialize_from: null
    60.   load_model: false
    61.   resume: false
    62.   force: false
    63.   train_model: true
    64.   inference: false
    65. debug: false
    66.  
    I have a theory that the cause of this is a bug inside the currently released version of ML-Agents in the package manager. I'm going to attempt to update to the github version and try again. This might be a super big headache, because I know I had to downgrade the python version to even get training to start (from something like 0.26 to 0.20)

    Update: Theory was correct. Updating to release 18 for ml-agents (unity package), 0.5.0 for ml-agents-envs (unity package), 0.27.0 for ml-agents (python), 0.27.0 for ml-agents-envs (python) and 0.27.0 gym-unity (python) fixed all of my --resume issues while using self-play.
     
    Last edited: Jun 11, 2021