Search Unity

Bug Setting Torch Device results in runtime error

Discussion in 'ML-Agents' started by ChillX, Feb 2, 2022.

  1. ChillX

    ChillX

    Joined:
    Jun 16, 2016
    Posts:
    145
    Command Line arguments:
    --torch-device="cuda:1"

    YAML config file settings:
    torch_settings:
    device: cuda:1


    Python stack trace:

    Traceback (most recent call last):
    File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 176, in start_learning
    n_steps = self.advance(env_manager)
    File "d:\cxmlunity\ml-agents\ml-agents-envs\mlagents_envs\timers.py", line 305, in wrapped
    return func(*args, **kwargs)
    File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\trainer_controller.py", line 234, in advance
    new_step_infos = env_manager.get_steps()
    File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\env_manager.py", line 124, in get_steps
    new_step_infos = self._step()
    File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\subprocess_env_manager.py", line 408, in _step
    self._queue_steps()
    File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\subprocess_env_manager.py", line 302, in _queue_steps
    env_action_info = self._take_step(env_worker.previous_step)
    File "d:\cxmlunity\ml-agents\ml-agents-envs\mlagents_envs\timers.py", line 305, in wrapped
    return func(*args, **kwargs)
    File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\subprocess_env_manager.py", line 544, in _take_step
    step_tuple[0], last_step.worker_id
    File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\policy\torch_policy.py", line 204, in get_action
    run_out = self.evaluate(decision_requests, global_agent_ids)
    File "d:\cxmlunity\ml-agents\ml-agents-envs\mlagents_envs\timers.py", line 305, in wrapped
    return func(*args, **kwargs)
    File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\policy\torch_policy.py", line 171, in evaluate
    tensor_obs, masks=masks, memories=memories
    File "d:\cxmlunity\ml-agents\ml-agents-envs\mlagents_envs\timers.py", line 305, in wrapped
    return func(*args, **kwargs)
    File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\policy\torch_policy.py", line 133, in sample_actions
    obs, masks, memories, seq_len
    File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\torch\networks.py", line 639, in get_action_and_stats
    inputs, memories=memories, sequence_length=sequence_length
    File "C:\Users\username\.conda\envs\UnityML\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
    File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\torch\networks.py", line 245, in forward
    encoding = self._body_endoder(encoded_self)
    File "C:\Users\username\.conda\envs\UnityML\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
    File "d:\cxmlunity\ml-agents\ml-agents\mlagents\trainers\torch\layers.py", line 169, in forward
    return self.seq_layers(input_tensor)
    File "C:\Users\username\.conda\envs\UnityML\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
    File "C:\Users\username\.conda\envs\UnityML\lib\site-packages\torch\nn\modules\container.py", line 117, in forward
    input = module(input)
    File "C:\Users\username\.conda\envs\UnityML\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
    File "C:\Users\username\.conda\envs\UnityML\lib\site-packages\torch\nn\modules\linear.py", line 93, in forward
    return F.linear(input, self.weight, self.bias)
    File "C:\Users\username\.conda\envs\UnityML\lib\site-packages\torch\nn\functional.py", line 1690, in linear
    ret = torch.addmm(bias, input, weight.t())
    RuntimeError: Expected tensor for 'out' to have the same device as tensor for argument #2 'mat1'; but device 1 does not equal 0 (while checking arguments for addmm)
     
  2. ChillX

    ChillX

    Joined:
    Jun 16, 2016
    Posts:
    145
    The workaround I'm using is to edit ml-agents\mlagents\torch_utils\torch.py
    and hardcode the torch device using the following edits:

    Comment out:
    _device = torch.device("cpu")

    Replace with:
    _device = torch.device("cuda:1")
    torch.cuda.set_device(1)

    in function set_torch_config
    Hardcode cuda device below the if statement which checks torch_settings.device
    device_str = "cuda:1"
    _device = torch.device(device_str)

    in dev default_device()
    hardcode cuda device
    # return _device
    return torch.device("cuda:1")

    With these three edits the cuda device is now hardcoded to cuda:1 with the exception of threaded mode.