INTRO: I'm making a game for my college final, which is supposed to be a 2D Dead by Daylight-esque game, in which the survivors are agents of a neural network, and the killer is the player. In 8 hours of training, the killer manages to win around 1000+ games, while the survivors only manage to win around 100+ of games, and the killer isn't all that good itself. THE PROBLEM: The survivors' neural network stops learning at some point (even though the learning rate schedule is constant), while still being unable to produce the expected level of results. THE SETUP AND THE DATA: The current survivors' task is to locate 3 objectives, pick them up and drop them off at a drop-off zone. The survivors' neural network is using the poca trainer type, since there are 3 agents working in a team together. The survivors' rewards are: (Agent reward) Each time they touch a wall = -0.01 (Agent and group reward) Object pick-up = +0.5 (Agent and group reward) Object drop-off = +0.5 (Agent and group reward) Death = -0.5 (Group reward) Win = +5 (Group reward) Lose = -5 (Group reward) No winner/loser = -2.5 The observations being given to each survivor (manually) are: Agent x and y velocity (normalized) Each other (only if visible) survivors' x and y velocity, a dot value representing whether they are in front/behind of the agent, and a dot value representing whether they are to the left/right of the agent Two dot values per each objective representing whether the objective is in front/behind and to the left/right Two dot values for the drop off zone, representing whether the zone is in front/behind and to the left/right of the agent Two dot values for the (only if visible) killer representing whether the killer is in front/behind and to the left/right of the agent Whether the agent is holding an item (they can't pick up more than one item before dropping it off) A survivor or killer is only visible to a survivor if they enter their cone of detection. Other observations being given to the survivors' are through the use of RayPerceptionSensors 2D, which can only detect walls. There are 3 continuous actions the survivors can make: How much force to apply to itself on the X axis (for movement) How much force to apply to itself on the Y axis (for movement) How much to rotate itself by on the Z axis I am also using a neural network to play the killer when training the game, since using an actual player to train the network would be a lot slower. The performance of the killer also leaves a lot to be desired, but it is more successful than the survivors due to having a simpler goal, and it is only used for training, so it is of much less concern. The trainer configuration I am using to train the network is attached at the bottom as a .txt file. I have tried playing around the config, but it didn't produce any more favorable results than these. The environment I'm training my NNs in completely randomizes itself each episode. The graphs generated by tensorboard are attached at the bottom as pictures as well. THE QUESTIONS: How do I approach this issue? Should I split my agents into multiple agents, creating a hierarchy of high level agents with their own brains which feed their outputs as observations to low level agents? Somehow that feels like too complex of a solution, for a problem that doesn't seem all that complicated. I've taken a look at the examples provided, focusing mostly on the Dungeon Escape example, and it uses so few observations to achieve a lot. I understand my problem is complex, but is it really that complex? What are your guys' ideas and suggestions? Please help!