Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Join us on Thursday, June 8, for a Q&A with Unity's Content Pipeline group here on the forum, and on the Unity Discord, and discuss topics around Content Build, Import Workflows, Asset Database, and Addressables!
    Dismiss Notice

gym_unity : how to handle the multiagent case

Discussion in 'ML-Agents' started by Procuste, Feb 19, 2020.

  1. Procuste

    Procuste

    Joined:
    Feb 10, 2020
    Posts:
    12
    Hello,

    I'm currently getting my hands on gym_unity, and I'm trying to play with environments with multiple agents (all sharing the same brain though).
    It is said in the gym_unity code :
    "When end of episode is reached, you are responsible for calling `reset()` to reset this environment's state."
    However, in the case where one agent terminated its episode (done = True for this particular agent), calling env.reset() would also reset the other agents.
    So I tried to not call env.reset() and just keep calling env.step(), but in that case the agent which terminated its episode gets assigned the 0 action for the next step.

    Here is a little demo to show you my problem.
    I slightly modified the Basic environment (the problem can also be seen with the Basic environment but I have modified it to make it obvious).
    So here is the environment :
    The agent starts at mPosition = 10. Action 0 makes it go right (mPosition is decreased by 1) while action 1 makes it go left (mPosition is increased by 1). The observation vector is simply composed of mPosition. When the agent reach one of the two goal (when mPosition=m_SmallGoalPosition or mPosition=
    m_LargeGoalPosition), the episode terminates for the agent.
    There are two agents in the environment. They have different m_SmallGoalPosition and m_LargeGoalPosition. I attached a Decision Requester to both of my agents, with a decision period of 1.
    In Python, here is my code :

    Code (CSharp):
    1. env = UnityEnvPerso("../../../unitydev/builtenvs/Basic_modified_v5", worker_id=0, multiagent=True)
    2.  
    3. o = env.reset()
    4. timestep = 0
    5. while timestep < 20:
    6.     o, r, dones, _ = env.step([1, 1])
    7.     print(o, r, dones)
    8.  
    9.     timestep+=1
    10.  
    11. env.close()
    And here is the output :
    Code (CSharp):
    1. [array([10.], dtype=float32), array([10.], dtype=float32)]
    2. [array([11.], dtype=float32), array([11.], dtype=float32)] [-0.01, -0.01] [False, False]
    3. [array([12.], dtype=float32), array([12.], dtype=float32)] [-0.01, -0.01] [False, False]
    4. [array([13.], dtype=float32), array([13.], dtype=float32)] [-0.01, -0.01] [False, False]
    5. [array([10.], dtype=float32), array([14.], dtype=float32)] [0.99, -0.01] [True, False]
    6. [array([9.], dtype=float32), array([15.], dtype=float32)] [-0.01, -0.01] [False, False]
    7. [array([10.], dtype=float32), array([16.], dtype=float32)] [-0.01, -0.01] [False, False]
    8. [array([11.], dtype=float32), array([10.], dtype=float32)] [-0.01, 0.99] [False, True]
    9. [array([12.], dtype=float32), array([9.], dtype=float32)] [-0.01, -0.01] [False, False]
    10. [array([13.], dtype=float32), array([10.], dtype=float32)] [-0.01, -0.01] [False, False]
    11. [array([10.], dtype=float32), array([11.], dtype=float32)] [0.99, -0.01] [True, False]
    12. [array([9.], dtype=float32), array([12.], dtype=float32)] [-0.01, -0.01] [False, False]
    13. [array([10.], dtype=float32), array([13.], dtype=float32)] [-0.01, -0.01] [False, False]
    14. [array([11.], dtype=float32), array([14.], dtype=float32)] [-0.01, -0.01] [False, False]
    15. [array([12.], dtype=float32), array([15.], dtype=float32)] [-0.01, -0.01] [False, False]
    16. [array([13.], dtype=float32), array([16.], dtype=float32)] [-0.01, -0.01] [False, False]
    17. [array([10.], dtype=float32), array([10.], dtype=float32)] [0.99, 0.99] [True, True]
    18. [array([9.], dtype=float32), array([9.], dtype=float32)] [-0.01, -0.01] [False, False]
    19. [array([10.], dtype=float32), array([10.], dtype=float32)] [-0.01, -0.01] [False, False]
    20. [array([11.], dtype=float32), array([11.], dtype=float32)] [-0.01, -0.01] [False, False]
    21. [array([12.], dtype=float32), array([12.], dtype=float32)] [-0.01, -0.01] [False, False]
    As you can see in my Python code, I always tell the agent to go left (action vecotr is [1, 1] so action 1 for both agents). Nonetheless, you can see in the output that the agent executed action 0 (the observation is equals to 9). In fact, you can see that it executed action 0 at the first timestep following the timestep on which it terminated the episode.

    I have read in the docs that when agents don't receive the action they requested, they automatically execute action 0. I think that this is what's happening here.

    I accept that not calling env.reset() is a bad habit, but how can one interact with such environment (2 or more agents) with gym_unity without being forced to call env.reset() ?

    Thank you very much.
     

    Attached Files:

    Last edited: Feb 19, 2020
  2. jeffrey_unity538

    jeffrey_unity538

    Unity Technologies

    Joined:
    Feb 15, 2018
    Posts:
    59
    hi proscute - have you tried to just have one agent in the scene, but do something like --num-envs=X in mlagents-learn instead?
     
  3. Procuste

    Procuste

    Joined:
    Feb 10, 2020
    Posts:
    12
    Hi, thank you for your response but my concern was interacting with the environment directly from Python, in order to later train my own algorithms.
    For those interested, I have actually been reimplementing a new Gym wrapper around the UnityEnvironment of mlagents_envs. This wrapper can provide informations about different agents (having the same type of Brain) without the problem that I showed in this post. I described the method that it used in the file.
    NOTE : The wrapper that I built doesn't support :
    • flatten branched of actions
    • action_mask
    • visual observations
    Moreover, I'm planning on :
    • allowing a decision period of more than 1
    • allowing multiple types of brains in one environment (not supported in the current wrapper gym_unity)
    Again, if you're interested : https://github.com/Procuste34/Unity-MLAgents/blob/master/gym_wrapper/gym_wrapper.py
    Maybe I should propose it to the dev ?
     
  4. jeffrey_unity538

    jeffrey_unity538

    Unity Technologies

    Joined:
    Feb 15, 2018
    Posts:
    59
    hi procuste - let me link your note to a couple of devs on the team
     
    Procuste likes this.