Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

FoodCollector example neural network

Discussion in 'ML-Agents' started by betike, Nov 16, 2020.

  1. betike

    betike

    Joined:
    May 28, 2019
    Posts:
    18
    Hello,

    Can someone explain how foodcollector example works in regards to the neural network? There are 5 agents and only 1 neural network, so while training, the agent observations are sent to NN from only one particular agent ( ie just 1 agent that sends observations to the nn) or is the NN multi-threaded and its taking observations from all 5 agents concurrently?

    Thanks

    Monica
     
  2. ruoping_unity

    ruoping_unity

    Unity Technologies

    Joined:
    Jul 10, 2020
    Posts:
    134
    The collection of observations from all agents will be stored in the same buffer, and then when doing policy update, it randomly pulls observation from the buffer.
    So there's multiple agents running concurrently, sending their observation to same one buffer, and there's one neural network using observations from all agents to update by pulling observation from the buffer.
     
    Kreuzkuemmelll and betike like this.
  3. betike

    betike

    Joined:
    May 28, 2019
    Posts:
    18
    Thanks so much for the explanation! So when pulling from the buffer, does the NN pull agent1 observations and sends back actions to agent1, then pulls agent2 observations and sends back actions to agent2 and so on? I basically wanna know if these agents are trained independently or not. If there are independent then it's as if each agent would have its only neural network.

    Thanks a lot!

    Monica
     
  4. ruoping_unity

    ruoping_unity

    Unity Technologies

    Joined:
    Jul 10, 2020
    Posts:
    134
    The answer is no, they're basically using the same network.

    For agents with same behavior, during training, their observation are collected into the same buffer and the model is trained on whatever data it randomly pulled.

    When generating action, the environment sends a batch of observation indexed with agent id to the model, and then the model generate actions based on the observation and output a batch of action for each agent id. So each agent got their own action based on their observation. Then they perform the action and get new observation, store the new observation-action into the buffer, request the model for the next action with the new observation, and so on.
     
    betike likes this.
  5. betike

    betike

    Joined:
    May 28, 2019
    Posts:
    18
    Hello,

    Thank you for your detailed explanation.

    Let me make sure I understood correctly:

    1. The multi agents in Foodcollector example are trained independently, even though they are using a single neural network, with the aid of the buffer because the buffer keeps a track of agent ids and received observations with agent ids and sends back actions with agents ids , correct?
    2. The agents don't share any common weights of the NN ?
    3. Agents share a policy while training?
    4. Is there any difference if I assign each agent its own NN, compared to how the training is done now with 1 NN?

    I am working on my Master's thesis in RL using Unity to observe emerging behaviors in multi-agents hence I need to know exactly how the training happens.

    Thanks a lot !! Really appreciate any help in understanding this!
     
    Last edited: Nov 20, 2020
  6. ruoping_unity

    ruoping_unity

    Unity Technologies

    Joined:
    Jul 10, 2020
    Posts:
    134
    You can think of it like this: Say you have one model (neural network), for one agent. You run inference with the agent to collect observations and store them in a replay buffer of a certain size. When the buffer is full, you randomly pull batches of observations to update the model. That's the typical workflow.

    Now you want to accelerate the whole process by collecting observations faster. So you still have only one model, but you create five identical agents which all use the same one model, and you run inference with all of them simultaneously. In that way you can fill up the buffer five times faster since you're collecting from five agents, but all observation and actions are generated from the same model with same behavior. When the buffer is full, you randomly pull batches of observations to update that one model. This step is exactly the same as in the previous settings. This is what FoodCollector is doing.

    So:
    1. You can't say they are trained individually. There's always only one model and they're basically copies of agents with the same policy. So if you update the model, all agents have the same new model. And buffer knows nothing about agent ids, it's just a bunch of samples of observation-action-reward that we need for training.
    2. The networks and weights are exactly the same in all agents.
    3. Agents share the same policy all the time, both training and inference. The only thing different among the agents is that their observations are different, and feeding in different observations (to the same model) will get them different actions, therefore you can get a bunch of different observations faster.


    Hope that clears your questions.
     
    JezMK and betike like this.