Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

Slow Training on a Server

Discussion in 'ML-Agents' started by kpalko, Jun 12, 2020.

  1. kpalko


    May 14, 2020
    Hi all,

    I am currently working on getting my server environment fully functional so I can run training off my personal laptop. Using the RollerBall example I am successfully able to train in both environments, but my Mac trains about 5x faster locally than on the server. For both comparisons, I'm training using a server executable and the command line.

    Current versions:
    ml-agents: 0.16.1
    ml-agents-env: 0.16.1
    Communicator API: 1.0.0
    Tensorflow 2.2.0
    Unity ml-agents: 1.1.0

    Mac: 8 cores, 2.3 GHz 8-Core Intel Core i9, 16GB RAM
    Server: 10 cores, Intel Xeon @ 2.50GHz, 40GB RAM

    Can anyone suggest logs that I should be looking at to investigate or provide any ideas on reasons this may happen.

    My guess is perhaps that I have set up my Linux executable incorrectly but I'm not sure the best way to verify that.
    Last edited: Jun 13, 2020
  2. kpalko


    May 14, 2020
    Ok, I've managed to get the training speed online near what I had on my local machine.

    After some investigation, we determined that the server's CPUs were oversubscribed by 2x, meaning there were far more threads being started than the CPUs could effectively utilize.

    The fix: Set
    environment variable to match the number of cores available on the server.

    In my case, I'm writing a bash script and pulling the number of cores available per task and setting it to that variable.

    I hope this might help someone who encounters a similar problem.
    CC724 likes this.