Search Unity

gym_unity : how to handle the multiagent case

Discussion in 'ML-Agents' started by Procuste, Feb 19, 2020.

  1. Procuste

    Procuste

    Joined:
    Feb 10, 2020
    Posts:
    12
    Hello,

    I'm currently getting my hands on gym_unity, and I'm trying to play with environments with multiple agents (all sharing the same brain though).
    It is said in the gym_unity code :
    "When end of episode is reached, you are responsible for calling `reset()` to reset this environment's state."
    However, in the case where one agent terminated its episode (done = True for this particular agent), calling env.reset() would also reset the other agents.
    So I tried to not call env.reset() and just keep calling env.step(), but in that case the agent which terminated its episode gets assigned the 0 action for the next step.

    Here is a little demo to show you my problem.
    I slightly modified the Basic environment (the problem can also be seen with the Basic environment but I have modified it to make it obvious).
    So here is the environment :
    The agent starts at mPosition = 10. Action 0 makes it go right (mPosition is decreased by 1) while action 1 makes it go left (mPosition is increased by 1). The observation vector is simply composed of mPosition. When the agent reach one of the two goal (when mPosition=m_SmallGoalPosition or mPosition=
    m_LargeGoalPosition), the episode terminates for the agent.
    There are two agents in the environment. They have different m_SmallGoalPosition and m_LargeGoalPosition. I attached a Decision Requester to both of my agents, with a decision period of 1.
    In Python, here is my code :

    Code (CSharp):
    1. env = UnityEnvPerso("../../../unitydev/builtenvs/Basic_modified_v5", worker_id=0, multiagent=True)
    2.  
    3. o = env.reset()
    4. timestep = 0
    5. while timestep < 20:
    6.     o, r, dones, _ = env.step([1, 1])
    7.     print(o, r, dones)
    8.  
    9.     timestep+=1
    10.  
    11. env.close()
    And here is the output :
    Code (CSharp):
    1. [array([10.], dtype=float32), array([10.], dtype=float32)]
    2. [array([11.], dtype=float32), array([11.], dtype=float32)] [-0.01, -0.01] [False, False]
    3. [array([12.], dtype=float32), array([12.], dtype=float32)] [-0.01, -0.01] [False, False]
    4. [array([13.], dtype=float32), array([13.], dtype=float32)] [-0.01, -0.01] [False, False]
    5. [array([10.], dtype=float32), array([14.], dtype=float32)] [0.99, -0.01] [True, False]
    6. [array([9.], dtype=float32), array([15.], dtype=float32)] [-0.01, -0.01] [False, False]
    7. [array([10.], dtype=float32), array([16.], dtype=float32)] [-0.01, -0.01] [False, False]
    8. [array([11.], dtype=float32), array([10.], dtype=float32)] [-0.01, 0.99] [False, True]
    9. [array([12.], dtype=float32), array([9.], dtype=float32)] [-0.01, -0.01] [False, False]
    10. [array([13.], dtype=float32), array([10.], dtype=float32)] [-0.01, -0.01] [False, False]
    11. [array([10.], dtype=float32), array([11.], dtype=float32)] [0.99, -0.01] [True, False]
    12. [array([9.], dtype=float32), array([12.], dtype=float32)] [-0.01, -0.01] [False, False]
    13. [array([10.], dtype=float32), array([13.], dtype=float32)] [-0.01, -0.01] [False, False]
    14. [array([11.], dtype=float32), array([14.], dtype=float32)] [-0.01, -0.01] [False, False]
    15. [array([12.], dtype=float32), array([15.], dtype=float32)] [-0.01, -0.01] [False, False]
    16. [array([13.], dtype=float32), array([16.], dtype=float32)] [-0.01, -0.01] [False, False]
    17. [array([10.], dtype=float32), array([10.], dtype=float32)] [0.99, 0.99] [True, True]
    18. [array([9.], dtype=float32), array([9.], dtype=float32)] [-0.01, -0.01] [False, False]
    19. [array([10.], dtype=float32), array([10.], dtype=float32)] [-0.01, -0.01] [False, False]
    20. [array([11.], dtype=float32), array([11.], dtype=float32)] [-0.01, -0.01] [False, False]
    21. [array([12.], dtype=float32), array([12.], dtype=float32)] [-0.01, -0.01] [False, False]
    As you can see in my Python code, I always tell the agent to go left (action vecotr is [1, 1] so action 1 for both agents). Nonetheless, you can see in the output that the agent executed action 0 (the observation is equals to 9). In fact, you can see that it executed action 0 at the first timestep following the timestep on which it terminated the episode.

    I have read in the docs that when agents don't receive the action they requested, they automatically execute action 0. I think that this is what's happening here.

    I accept that not calling env.reset() is a bad habit, but how can one interact with such environment (2 or more agents) with gym_unity without being forced to call env.reset() ?

    Thank you very much.
     

    Attached Files:

    Last edited: Feb 19, 2020
  2. jeffrey_unity538

    jeffrey_unity538

    Unity Technologies

    Joined:
    Feb 15, 2018
    Posts:
    59
    hi proscute - have you tried to just have one agent in the scene, but do something like --num-envs=X in mlagents-learn instead?
     
  3. Procuste

    Procuste

    Joined:
    Feb 10, 2020
    Posts:
    12
    Hi, thank you for your response but my concern was interacting with the environment directly from Python, in order to later train my own algorithms.
    For those interested, I have actually been reimplementing a new Gym wrapper around the UnityEnvironment of mlagents_envs. This wrapper can provide informations about different agents (having the same type of Brain) without the problem that I showed in this post. I described the method that it used in the file.
    NOTE : The wrapper that I built doesn't support :
    • flatten branched of actions
    • action_mask
    • visual observations
    Moreover, I'm planning on :
    • allowing a decision period of more than 1
    • allowing multiple types of brains in one environment (not supported in the current wrapper gym_unity)
    Again, if you're interested : https://github.com/Procuste34/Unity-MLAgents/blob/master/gym_wrapper/gym_wrapper.py
    Maybe I should propose it to the dev ?
     
  4. jeffrey_unity538

    jeffrey_unity538

    Unity Technologies

    Joined:
    Feb 15, 2018
    Posts:
    59
    hi procuste - let me link your note to a couple of devs on the team
     
    Procuste likes this.