Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice
  2. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

Slow Training on a Server

Discussion in 'ML-Agents' started by kpalko, Jun 12, 2020.

  1. kpalko

    kpalko

    Joined:
    May 14, 2020
    Posts:
    5
    Hi all,

    I am currently working on getting my server environment fully functional so I can run training off my personal laptop. Using the RollerBall example I am successfully able to train in both environments, but my Mac trains about 5x faster locally than on the server. For both comparisons, I'm training using a server executable and the command line.

    Current versions:
    ml-agents: 0.16.1
    ml-agents-env: 0.16.1
    Communicator API: 1.0.0
    Tensorflow 2.2.0
    Unity ml-agents: 1.1.0

    Hardware:
    Mac: 8 cores, 2.3 GHz 8-Core Intel Core i9, 16GB RAM
    Server: 10 cores, Intel Xeon @ 2.50GHz, 40GB RAM

    Can anyone suggest logs that I should be looking at to investigate or provide any ideas on reasons this may happen.

    My guess is perhaps that I have set up my Linux executable incorrectly but I'm not sure the best way to verify that.
     
    Last edited: Jun 13, 2020
  2. kpalko

    kpalko

    Joined:
    May 14, 2020
    Posts:
    5
    Ok, I've managed to get the training speed online near what I had on my local machine.

    After some investigation, we determined that the server's CPUs were oversubscribed by 2x, meaning there were far more threads being started than the CPUs could effectively utilize.

    The fix: Set
    OMP_NUM_THREADS
    environment variable to match the number of cores available on the server.

    In my case, I'm writing a bash script and pulling the number of cores available per task and setting it to that variable.

    I hope this might help someone who encounters a similar problem.
     
    CC724 likes this.