Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.

SAC long train time on 12 cores AMD 5900X

Discussion in 'ML-Agents' started by Roboserg, Dec 22, 2020.

  1. Roboserg

    Roboserg

    Joined:
    Jun 3, 2018
    Posts:
    83
    I played with Unity ML agents 9 months ago and PPO and SAC didn't have any visible difference with training times. Now I have a much faster CPU, PPO can train and visualize agents at 20x speed without problems, yet with SAC Unity does this constant hiccups, I assume the model is training. Is this supposed to happen and if so, why did I not have any problems on a much weaker 4 cores CPU 9 month ago? I must be doing something wrong.
    Using the latest mlagents11 and the default pushblock example with default hyperpara.
    It seems like only one core, out of 12, is used for training.


     
    Last edited: Dec 22, 2020
  2. Roboserg

    Roboserg

    Joined:
    Jun 3, 2018
    Posts:
    83
  3. TreyK-47

    TreyK-47

    Unity Technologies

    Joined:
    Oct 22, 2019
    Posts:
    1,741
    I'll flag with the team for some insight.
     
  4. ervteng_unity

    ervteng_unity

    Unity Technologies

    Joined:
    Dec 6, 2018
    Posts:
    150
    Hi Roboserg, this is actually a known issue with PyTorch (and by extension the PyTorch version of ML-Agents) - it doesn't parallelize well. We actually cap the number of CPUs it is allowed to use to 4. We're still looking at how to improve this going forward. In the meantime, running with --tensorflow should resolve the performance issues.
     
  5. ervteng_unity

    ervteng_unity

    Unity Technologies

    Joined:
    Dec 6, 2018
    Posts:
    150
    Wanted to add that you may be able to improve performance by increasing batch size and steps_per_update in the configuration. Larger batches should parallelize better.

    If you're willing to edit some Python code, going into cpu_utils.py and editing this line:

    Code (Python):
    1. return max(min(num_cpus // 2, 4), 1) if num_cpus is not None else None
    Changing the 4 to something larger (8 or 10) might work better, especially in conjunction with increasing batch size.
     
  6. kokimitsunami

    kokimitsunami

    Joined:
    Sep 2, 2021
    Posts:
    11
    Hi,
    I'd like to try to change the number of CPU cores for training as mentioned above. I found where the code is located but cannot find any instruction on building/running the code. Am I correct in my understanding that I need to re-build it as a Python Package and run mlagents-learn command? It would be great if anyone could share how to build/run it.
    Thanks.
     
    Last edited: Dec 1, 2022