Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice
  2. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

Resolved ML-Agent server build (headless) with visual observations and multiple GPUs

Discussion in 'ML-Agents' started by qiwu57kevin, Jan 20, 2021.

  1. qiwu57kevin

    qiwu57kevin

    Joined:
    Apr 18, 2019
    Posts:
    5
    I am running my custom mlagents environment on a server with a Unity server build. However, I have some problems with this server build that hope to be answered:

    • I am using visual observations in my environment. However, the server has no monitor therefore it is not able to render the images. If I build in a headless mode, there will no visual output. Is there anyway that I could run it in a headless server with visual observations? I have been stuck in this for several days and any help or suggestions will be greatly appreciated!
    • I have multiple GPUs (4 RTX 2080) on my server and I am wondering if mlagents can configure to use all or part of them. Since I have heavy visual inputs, and I changed your CNN to a more complex MobileNet, I may need to use multiple GPUs. If not, is there another way you might have to realize this?
    Thank you!
     
  2. celion_unity

    celion_unity

    Joined:
    Jun 12, 2019
    Posts:
    289
    (copying and pasting my response from github - let's continue the discussion here and close the gitbub issue)

    Hi,
    You'll need something like xvfb to do visual observations on a remote machine. There's an example of setting this up in the collab notebook here (under the "setup" section).

    We don't currently support multiple GPUs for training. We have a request for this logged already as MLA-662 in our internal tracker, but there's no timeline for implementing it.

    Torch will use a single GPU as long as its set up correctly. You’ll need to install a torch version that is compatible with your CUDA version, though - otherwise it won’t detect your GPU devices and will run on CPU instead.
     
  3. qiwu57kevin

    qiwu57kevin

    Joined:
    Apr 18, 2019
    Posts:
    5
    Hi Celion! Sure I will continue the discussion here!

    I checked the setup in the collab notebook and set up the virtual display according to this https://gist.github.com/jterrace/2911875. After that, I started CLI on the server and ran the following commands
    Code (CSharp):
    1. mlagents-learn config_gesthor_ppo.yaml --run-id=test --force
    . However, it still gives me the
    UnityTimeOutException
    . It stopped at the following page for a long time and popped up this error. It looks like it found the path to the executable but was looking for a screen.

    I am not sure if I set up the virtual display correctly but I couldn't start the training. Any advice will help me a lot!
    upload_2021-1-20_15-58-6.png
     
  4. qiwu57kevin

    qiwu57kevin

    Joined:
    Apr 18, 2019
    Posts:
    5
    And for mutiple GPUs, I think it is fine to use only one if multiples are not supported yet. Hope this support comes soon!
     
    CloudyVR likes this.
  5. qiwu57kevin

    qiwu57kevin

    Joined:
    Apr 18, 2019
    Posts:
    5
    Ah, I found that I should add
    xvfb-run
    before my command in order to use the virtual display! The timeout error went away and I am waiting for the training result!
     
    celion_unity likes this.
  6. celion_unity

    celion_unity

    Joined:
    Jun 12, 2019
    Posts:
    289
    Great, glad you got training working with xvfb-run.

    I looked into this a little more today, and there's at least two bad things going on:
    1) The Unity executable crashes when Camera.Render() is called by the Camera Sensor, without logging any useful error messages. I think failing fast here is actually preferable to not crashing, but we can at least log an error if SystemInfo.graphicsDeviceType == GraphicsDeviceType.Null
    2) Even though the executable crashes right away, mlagents-learn hangs until it times out, and displays the "The Unity environment took too long to respond" message that you saw. So it's a bad experience and a misleading error message.

    I've got these logged as MLA-1713 and MLA-1712 respectively in our bug tracker. The first should be an easy fix, but the latter will take some more digging.
     
  7. celion_unity

    celion_unity

    Joined:
    Jun 12, 2019
    Posts:
    289
  8. celion_unity

    celion_unity

    Joined:
    Jun 12, 2019
    Posts:
    289