Search Unity

Question Using multiple PCs to train a single AI - How to connect all PCs to a single mlagents-learn server?

Discussion in 'ML-Agents' started by CloudyVR, Apr 30, 2023.

  1. CloudyVR

    CloudyVR

    Joined:
    Mar 26, 2017
    Posts:
    715
    I have three PCs in my apartment I would like to use them for distributed training on a single neural network.

    I am wondering if it is possible to launch the mlagents-learn server with the command --num-envs 3 and from each PC launch a game instance where the agent connects to the PC running mlagents-learn server?

    I am not sure what commands to supply to tell the agent to connect to a server other than on the localhost address. How can I specify a network IP address for each game instance so they connect to the training server on the LAN?

    Thank you for any guidance!!
     
    Mister2023 likes this.
  2. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    I do distributed training using a docker swarm but there may be an easier way to do it for a purely local network. Let me know if you'd like more details on that.
     
    CloudyVR likes this.
  3. CloudyVR

    CloudyVR

    Joined:
    Mar 26, 2017
    Posts:
    715

    Thanks! I'd really like to know how to do distributed leaning using all of my render towers

    I'd like to use a central GPU enabled PC for CUDA while using three other PCs for the physics simulations.

    Do you use mlagents-learn command with the --env parameter to lunch the training on multiple PCs?

    So far I have not found any way to launch the mlagents-learn command on one PC and launching training environments on separate PCs.

    I have tried looking at the mlagents source code and tried everything to modify the code but was never able to figure out how to launch multiple environments or specify a network address for other instances to connect to, only the port number can be specified as the address is hard coded as localhost..

    Any ideas how to overcome this limit of one PC, my PC struggles when I launch 8 concurrent environments and the lag spikes cause my agents to sometimes make mistakes which I have notice can sometimes harm training. I'd really like to spread training onto multiple PCs we have here to reduce lag spikes for physics intenive training, and hopefully to also speed up training a bit if possible!
     
  4. CloudyVR

    CloudyVR

    Joined:
    Mar 26, 2017
    Posts:
    715
    Bump, I am using ML agents almost daily now and hoping to train more effectively using distributed training, I am spending so much time waiting for a agent to find a policy because I can only train on a single PC. But I have four other towers sitting doing noting while my main tower CPU is near maxed out. Is there anything i can do? This is so costly in time having to do all training on one PC.

    I feel like it would be very simple to make ML agents train on multiple PCs, but despite spending a week looking at the python code I could not figure out a solution.

    Could someone help me figure out how to do distributed training? Please
     
  5. Mister2023

    Mister2023

    Joined:
    May 10, 2023
    Posts:
    2
    Training a single AI across multiple PCs can significantly speed up the process and enhance learning. To connect all your PCs to a single mlagents-learn server, you can utilize network configurations like IP addresses and port forwarding. Make sure each PC is on the same network and reachable by the server. For more detailed guidance and insights on this topic, you might want to explore resources at https://topaigenerators.com. They offer a range of AI-related information that could be beneficial for your project. Happy AI training!
     
  6. CloudyVR

    CloudyVR

    Joined:
    Mar 26, 2017
    Posts:
    715
    Hi!

    Are you saying that mlagents already supportes network training on multiple PCs??

    I was not able to find any information about or even any mention of network configuration other than a port number and local loopback address.

    After many days of work I managed to rewrite the mlagents-learn server and Python package to accept a external IP address and it works exceptionally well: https://forum.unity.com/threads/training-ml-agents-in-cloud.1471692/#post-9206103

    However if it's possible to use mlagents over a network natively that would be a even better solution!!

    Is distributed network training possible in the latest mlagetns package or were you referring to something else?

    Thank you!
     
    Last edited: Aug 17, 2023