Bug Setting Torch Device results in runtime error

ChillX · Feb 2, 2022

Command Line arguments:
--torch-device="cuda:1"

YAML config file settings:
torch_settings:
device: cuda:1

Python stack trace:

Traceback (most recent call last):
File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 176, in start_learning
n_steps = self.advance(env_manager)
File "d:\cxmlunity\ml-agents\ml-agents-envs\mlagents_envs\timers.py", line 305, in wrapped
return func(*args, **kwargs)
File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 234, in advance
new_step_infos = env_manager.get_steps()
File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\env_manager.py", line 124, in get_steps
new_step_infos = self._step()
File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\subprocess_env_manager.py", line 408, in _step
self._queue_steps()
File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\subprocess_env_manager.py", line 302, in _queue_steps
env_action_info = self._take_step(env_worker.previous_step)
File "d:\cxmlunity\ml-agents\ml-agents-envs\mlagents_envs\timers.py", line 305, in wrapped
return func(*args, **kwargs)
File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\subprocess_env_manager.py", line 544, in _take_step
step_tuple[0], last_step.worker_id
File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\policy\torch_policy.py", line 204, in get_action
run_out = self.evaluate(decision_requests, global_agent_ids)
File "d:\cxmlunity\ml-agents\ml-agents-envs\mlagents_envs\timers.py", line 305, in wrapped
return func(*args, **kwargs)
File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\policy\torch_policy.py", line 171, in evaluate
tensor_obs, masks=masks, memories=memories
File "d:\cxmlunity\ml-agents\ml-agents-envs\mlagents_envs\timers.py", line 305, in wrapped
return func(*args, **kwargs)
File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\policy\torch_policy.py", line 133, in sample_actions
obs, masks, memories, seq_len
File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\torch\networks.py", line 639, in get_action_and_stats
inputs, memories=memories, sequence_length=sequence_length
File "C:\Users\username\.conda\envs\UnityML\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\torch\networks.py", line 245, in forward
encoding = self._body_endoder(encoded_self)
File "C:\Users\username\.conda\envs\UnityML\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\torch\layers.py", line 169, in forward
return self.seq_layers(input_tensor)
File "C:\Users\username\.conda\envs\UnityML\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\username\.conda\envs\UnityML\lib\site-packages\torch\nn\modules\container.py", line 117, in forward
input = module(input)
File "C:\Users\username\.conda\envs\UnityML\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\username\.conda\envs\UnityML\lib\site-packages\torch\nn\modules\linear.py", line 93, in forward
return F.linear(input, self.weight, self.bias)
File "C:\Users\username\.conda\envs\UnityML\lib\site-packages\torch\nn\functional.py", line 1690, in linear
ret = torch.addmm(bias, input, weight.t())
RuntimeError: Expected tensor for 'out' to have the same device as tensor for argument #2 'mat1'; but device 1 does not equal 0 (while checking arguments for addmm)

ChillX · Feb 2, 2022

The workaround I'm using is to edit ml-agents\mlagents\torch_utils\torch.py
and hardcode the torch device using the following edits:

Comment out:
_device = torch.device("cpu")

Replace with:
_device = torch.device("cuda:1")
torch.cuda.set_device(1)

in function set_torch_config
Hardcode cuda device below the if statement which checks torch_settings.device
device_str = "cuda:1"
_device = torch.device(device_str)

in dev default_device()
hardcode cuda device
# return _device
return torch.device("cuda:1")

With these three edits the cuda device is now hardcoded to cuda:1 with the exception of threaded mode.

Search Unity

Unity ID

Useful Searches

Bug Setting Torch Device results in runtime error

ChillX

ChillX