# Question Agents' learning is stuck - unsure how to approach issue.

Discussion in 'ML-Agents' started by MihaelDiklic, Aug 3, 2021.

1. ### MihaelDiklic

Joined:
Apr 7, 2021
Posts:
11
INTRO:
I'm making a game for my college final, which is supposed to be a 2D Dead by Daylight-esque game, in which the survivors are agents of a neural network, and the killer is the player. In 8 hours of training, the killer manages to win around 1000+ games, while the survivors only manage to win around 100+ of games, and the killer isn't all that good itself.

THE PROBLEM:
The survivors' neural network stops learning at some point (even though the learning rate schedule is constant), while still being unable to produce the expected level of results.

THE SETUP AND THE DATA:
The current survivors' task is to locate 3 objectives, pick them up and drop them off at a drop-off zone. The survivors' neural network is using the poca trainer type, since there are 3 agents working in a team together. The survivors' rewards are:
• (Agent reward) Each time they touch a wall = -0.01
• (Agent and group reward) Object pick-up = +0.5
• (Agent and group reward) Object drop-off = +0.5
• (Agent and group reward) Death = -0.5
• (Group reward) Win = +5
• (Group reward) Lose = -5
• (Group reward) No winner/loser = -2.5
The observations being given to each survivor (manually) are:
• Agent x and y velocity (normalized)
• Each other (only if visible) survivors' x and y velocity, a dot value representing whether they are in front/behind of the agent, and a dot value representing whether they are to the left/right of the agent
• Two dot values per each objective representing whether the objective is in front/behind and to the left/right
• Two dot values for the drop off zone, representing whether the zone is in front/behind and to the left/right of the agent
• Two dot values for the (only if visible) killer representing whether the killer is in front/behind and to the left/right of the agent
• Whether the agent is holding an item (they can't pick up more than one item before dropping it off)
A survivor or killer is only visible to a survivor if they enter their cone of detection.
Other observations being given to the survivors' are through the use of RayPerceptionSensors 2D, which can only detect walls.
There are 3 continuous actions the survivors can make:
• How much force to apply to itself on the X axis (for movement)
• How much force to apply to itself on the Y axis (for movement)
• How much to rotate itself by on the Z axis
I am also using a neural network to play the killer when training the game, since using an actual player to train the network would be a lot slower. The performance of the killer also leaves a lot to be desired, but it is more successful than the survivors due to having a simpler goal, and it is only used for training, so it is of much less concern.

The trainer configuration I am using to train the network is attached at the bottom as a .txt file. I have tried playing around the config, but it didn't produce any more favorable results than these.
The environment I'm training my NNs in completely randomizes itself each episode.
The graphs generated by tensorboard are attached at the bottom as pictures as well.

THE QUESTIONS:

How do I approach this issue? Should I split my agents into multiple agents, creating a hierarchy of high level agents with their own brains which feed their outputs as observations to low level agents? Somehow that feels like too complex of a solution, for a problem that doesn't seem all that complicated.
I've taken a look at the examples provided, focusing mostly on the Dungeon Escape example, and it uses so few observations to achieve a lot. I understand my problem is complex, but is it really that complex? What are your guys' ideas and suggestions? Please help!

#### Attached Files:

File size:
1.9 KB
Views:
85
File size:
76 KB
Views:
207
File size:
82.7 KB
Views:
201
• ###### policyGraphs.png
File size:
117.1 KB
Views:
208

Joined:
Dec 31, 2017
Posts:
472
Have you tried with a heuristic that emulates player behavior instead? This should give you better insight into the the survivor agents' training progress. You would be able to control the killer agent's skill level / the difficulty survivor agents are training against.
Is agent movement constrained to world x/y axes? Or can it move with respect to its own forward direction / z-axis rotation? If it's the latter, then I would localize all observations, so they are in the agent's local frame of reference.

Try simplifying all your dot value observations to normalized angles: SignedAngle(forward, object_position - agent_position) / 180. So that you'll have a single float per observed direction which is also linear, rather than sinusoidal.
Alternatively, you can test if visual / grid observations yield better results than direction values.

3. ### MihaelDiklic

Joined:
Apr 7, 2021
Posts:
11
I haven't tried using heuristics as the player to train the survivors, but I will give it a go. However I feel like this won't really solve the issue, since the main issue with the players' performance isn't that the killer kills them too quickly, it's more so the fact that they struggle with picking up and dropping off objectives. I will attach a video to showcase the agents' brains after training in a minute.

The agents' movement is constrained to the world x/y axis', so no need for normalization there, but I have measured the agents' min and max velocity in order to normalize that.

I initially used Vector3.SignedAngle, but I thought that that might be an issue since its return value is between 0 and 1, while all of the other values I'm feeding into the network are normalized between -1 and 1. Do you think I should still use SignedAngle in spite of that?

How would you go about implementing grid based observations? Do I need to split the environment into cells, place them in a grid, and then feed the agents not only information on their angle towards something, but also the cell in which the agent is and the cell in which that something is? Or am I overthinking this? Thank you for the reply!

4. ### MihaelDiklic

Joined:
Apr 7, 2021
Posts:
11
Here's a video example of my agents after cca 8 hours of training.
The killer is red, the survivors are blue, the objectives are green and the drop off zone is that white circle.

Joined:
Dec 31, 2017
Posts:
472
Are you sure you were using Vector3.SignedAngle, not Vector3.Angle? Vector3.SignedAngle should also give you negative values.
The visual observations would replace the direction values. I think the easiest approach for this would be placing an orthographic camera above the environment. Perhaps put additional renderers on all observable objects which are only visible to that camera, and give them distict material colors. Like killer: red, agent: green, wall: blue, or something like that. This way, you wouldn't have to deal with partitioning the environment into a grid. The ml-agents grid sensor doesn't work with 2D physics anyway AFAIK.

6. ### MihaelDiklic

Joined:
Apr 7, 2021
Posts:
11
No, I was using SignedAngle, but I just realized that I had been feeding it the wrong values when I tested it .
I will replace all of my Vector3.Dots with SignedAngles.

Alright, that sounds like a promising idea, thank you! I will give it a shot and will report back my findings.
I have one question though. Since the survivors are not supposed to see the killer until the killer is in their FOV and vice-versa, how would I go about handling that while using your approach with the orthographic camera? Would I just make 2 cameras (or a camera per agent) and somehow ignore rendering anything that isn't visible to the agent?
Btw, I've posted a video example of the agents after training so you can take a better look at their behaviour.

Joined:
Dec 31, 2017
Posts:
472
No problem - you're right, if there's an orthographic camera sensor per agent, each sensor would need to see different objects, depending on whether they are in the agent's FOV. That would require renderers placed on different layers, and each camera's culling mask set to distinct layers. Well, maybe that's too complicated after all... perhaps try regular camera sensors attached to agents instead? Or, if you want to check out my grid sensor, take a look at GitHub - mbaske/grid-sensor: Grid Sensor Components for Unity ML-Agents Since your game is 2D, it wouldn't be able to detect gameobjects though. You would need to use the basic sensor component and write values to its buffer.

8. ### MihaelDiklic

Joined:
Apr 7, 2021
Posts:
11
I will check out everything you said first thing after work, if that's okay with you, and I'll tell you everything then. Again, thanks a bunch for helping out.

9. ### MihaelDiklic

Joined:
Apr 7, 2021
Posts:
11
@mbaske I've tried out using Vector3.SignedAngle instead of Vector3.Dot. I trained the network for about 6 hours, but the results were basically the same, sadly.
I'm going to try attaching a camera sensor to each agent and see what results that will yield, since that option is easier to implement. If that doesn't work, I will try out your grid sensor and see what results that will get me.

What do you think of the idea I mentioned in my initial post? About separating the issue into multiple layers of agents? Where the top level agent would, in my case, gather observations on the positions of objectives, obstacles and threats and as a result generate a position in the world, which it would then feed the bottom level agent as observations.
The bottom level agent could be trained prior to this, and the only knowledge it would have is how to evade obstacles while getting from point A to point B.
I feel like that idea could positively influence the NN, if the problem I am facing is simply that my network is trying to learn too many things at the same time. However, I have no clue if that is the case. What do you think?

Also, what do you think of my trainer config? Are there any glaring mistakes in it which could be negatively impacting the training?

Joined:
Dec 31, 2017
Posts:
472
I think it depends on the type of agent movement. If it's somewhat complex, then separarting behavior components can help. I had some success doing this for agents who had to coordinate leg movements or quadcopter rotor thrusts. In that case, I would first train them to move into a given direction, and then train another policy for supplying the direction vector. I'm not sure how much could be gained using this apprach though, if the movement is as simple as just applying some overall force.

Regarding the config file, I usually base mine on some ml-agents example project, that I feel most closely resembles what I'm trying to do. But I don't have a strong intuition about how exactly changing beta, epsilon, lambda etc will influence training. 64 seems rather small for your time horizon though. My guess is it should be larger, because picking up and dropping off items sounds like a long term behavior

11. ### MihaelDiklic

Joined:
Apr 7, 2021
Posts:
11
I've also noticed that my time horizon is very small. I've increased it to its max recommended value and I'll try giving it a whirl.
I wanted to ask you what you thought about using LSTM in my trainer config?

Seems like something that would be useful to me, but I wanted to make sure before-hand, since they recommend switching from continuous to discrete actions, so I would have to rewire that part of my code somewhat.

Joined:
Dec 31, 2017
Posts:
472
Haven't really used LSTM myself yet. I'm not sure it would help your agents though, because as I understand your setup, agents don't need to remember any past observations, or do they? If it's just a matter of keeping track of the number of items an agent has already moved, then you could probably encode that info as an observation, so it's part of the agent's current state.

13. ### MihaelDiklic

Joined:
Apr 7, 2021
Posts:
11
What you say about my setup is true, but from my understanding (which is limited) this sounds like it would help the agents remember which actions in which order lead to which results.
What I mean by that is, for example, if my velocity is x and y and the killer's velocity is z and q, then I know how to change my velocity from previous experience more easily, due to an added layer of memory. Or even something like, if I have an item, I should always try to go towards the drop off zone observation, and if I don't, I should try to go to an objective observation.
However, I understand this is something each neural network should sort of understand on their own, even without LSTM, I just thought it might be of extra help. I could be completely wrong though and if you know more about this, I would greatly appreciate your input.

14. ### MihaelDiklic

Joined:
Apr 7, 2021
Posts:
11
@mbaske Hey there, just wanted to tell you that I've tried out camera sensors on the agents, and I haven't had any luck really. The successes of the killer and the campers are about the same, which is very disheartening.
I don't understand how the Dungeon Escape example uses so little observations and is so successful. My environment should be around the same difficulty, even easier perhaps since it is in 2D.
I really have no clue where to go from here. Would you take a look at my project perhaps, if I made the git repository public to you?

15. ### carlosm

Joined:
Sep 17, 2015
Posts:
7
Any luck? I'm stuck with sort of the same problem.