gym_unity : how to handle the multiagent case

Procuste · Feb 19, 2020

Hello,

I'm currently getting my hands on gym_unity, and I'm trying to play with environments with multiple agents (all sharing the same brain though).
It is said in the gym_unity code :
"When end of episode is reached, you are responsible for calling `reset()` to reset this environment's state."
However, in the case where one agent terminated its episode (done = True for this particular agent), calling env.reset() would also reset the other agents.
So I tried to not call env.reset() and just keep calling env.step(), but in that case the agent which terminated its episode gets assigned the 0 action for the next step.

Here is a little demo to show you my problem.
I slightly modified the Basic environment (the problem can also be seen with the Basic environment but I have modified it to make it obvious).
So here is the environment :
The agent starts at mPosition = 10. Action 0 makes it go right (mPosition is decreased by 1) while action 1 makes it go left (mPosition is increased by 1). The observation vector is simply composed of mPosition. When the agent reach one of the two goal (when mPosition=m_SmallGoalPosition or mPosition=
m_LargeGoalPosition), the episode terminates for the agent.
There are two agents in the environment. They have different m_SmallGoalPosition and m_LargeGoalPosition. I attached a Decision Requester to both of my agents, with a decision period of 1.
In Python, here is my code :

Code (CSharp):

env = UnityEnvPerso("../../../unitydev/builtenvs/Basic_modified_v5", worker_id=0, multiagent=True)

o = env.reset()

timestep = 0

while timestep < 20:

o, r, dones, _ = env.step([1, 1])

print(o, r, dones)

timestep+=1

env.close()

And here is the output :

Code (CSharp):

[array([10.], dtype=float32), array([10.], dtype=float32)]

[array([11.], dtype=float32), array([11.], dtype=float32)] [-0.01, -0.01] [False, False]

[array([12.], dtype=float32), array([12.], dtype=float32)] [-0.01, -0.01] [False, False]

[array([13.], dtype=float32), array([13.], dtype=float32)] [-0.01, -0.01] [False, False]

[array([10.], dtype=float32), array([14.], dtype=float32)] [0.99, -0.01] [True, False]

[array([9.], dtype=float32), array([15.], dtype=float32)] [-0.01, -0.01] [False, False]

[array([10.], dtype=float32), array([16.], dtype=float32)] [-0.01, -0.01] [False, False]

[array([11.], dtype=float32), array([10.], dtype=float32)] [-0.01, 0.99] [False, True]

[array([12.], dtype=float32), array([9.], dtype=float32)] [-0.01, -0.01] [False, False]

[array([13.], dtype=float32), array([10.], dtype=float32)] [-0.01, -0.01] [False, False]

[array([10.], dtype=float32), array([11.], dtype=float32)] [0.99, -0.01] [True, False]

[array([9.], dtype=float32), array([12.], dtype=float32)] [-0.01, -0.01] [False, False]

[array([10.], dtype=float32), array([13.], dtype=float32)] [-0.01, -0.01] [False, False]

[array([11.], dtype=float32), array([14.], dtype=float32)] [-0.01, -0.01] [False, False]

[array([12.], dtype=float32), array([15.], dtype=float32)] [-0.01, -0.01] [False, False]

[array([13.], dtype=float32), array([16.], dtype=float32)] [-0.01, -0.01] [False, False]

[array([10.], dtype=float32), array([10.], dtype=float32)] [0.99, 0.99] [True, True]

[array([9.], dtype=float32), array([9.], dtype=float32)] [-0.01, -0.01] [False, False]

[array([10.], dtype=float32), array([10.], dtype=float32)] [-0.01, -0.01] [False, False]

[array([11.], dtype=float32), array([11.], dtype=float32)] [-0.01, -0.01] [False, False]

[array([12.], dtype=float32), array([12.], dtype=float32)] [-0.01, -0.01] [False, False]

As you can see in my Python code, I always tell the agent to go left (action vecotr is [1, 1] so action 1 for both agents). Nonetheless, you can see in the output that the agent executed action 0 (the observation is equals to 9). In fact, you can see that it executed action 0 at the first timestep following the timestep on which it terminated the episode.

I have read in the docs that when agents don't receive the action they requested, they automatically execute action 0. I think that this is what's happening here.

I accept that not calling env.reset() is a bad habit, but how can one interact with such environment (2 or more agents) with gym_unity without being forced to call env.reset() ?

Thank you very much.

jeffrey_unity538 · Feb 20, 2020

hi proscute - have you tried to just have one agent in the scene, but do something like --num-envs=X in mlagents-learn instead?

Procuste · Feb 21, 2020

Hi, thank you for your response but my concern was interacting with the environment directly from Python, in order to later train my own algorithms.
For those interested, I have actually been reimplementing a new Gym wrapper around the UnityEnvironment of mlagents_envs. This wrapper can provide informations about different agents (having the same type of Brain) without the problem that I showed in this post. I described the method that it used in the file.
NOTE : The wrapper that I built doesn't support :

flatten branched of actions

action_mask

visual observations

Moreover, I'm planning on :

allowing a decision period of more than 1

allowing multiple types of brains in one environment (not supported in the current wrapper gym_unity)

Again, if you're interested : https://github.com/Procuste34/Unity-MLAgents/blob/master/gym_wrapper/gym_wrapper.py
Maybe I should propose it to the dev ?

jeffrey_unity538 · Feb 24, 2020

hi procuste - let me link your note to a couple of devs on the team

Search Unity

gym_unity : how to handle the multiagent case

Procuste

Attached Files:

BasicAgent.cs

jeffrey_unity538

Unity Technologies

Procuste

jeffrey_unity538

Unity Technologies

Search Unity

Unity ID

Useful Searches

gym_unity : how to handle the multiagent case

Procuste

Attached Files:

BasicAgent.cs

jeffrey_unity538

Unity Technologies

Procuste

jeffrey_unity538

Unity Technologies