Question Agents keep doing one thing that they are punished for

mrshinx · May 30, 2023

I'm trying to train my rabbit agents to collect food in a closed environment. There are four walls around the area that the agents should not touch. If they stay in contact with the wall, they are punished every step.

After some time, instead of trying to collect food (which is available everywhere), they all start to hug the wall despite being heavily punished while doing so. I just can't understand this kind of behavior, seems like they are trying to pursuit the most negative reward. As seen in the picture, the rabbits are touching the top-left wall corner.

And yes, they have ray perception censor to detect the walls

I'm at step 62M (about 1.7 days of training) and they are having a lot of negative reward because of this

Any help is appreciated!

Luke-Houlihan · May 30, 2023

Can you provide an overview of other positive/negative rewards your environment is assigning?

This often happens in cases where the policy cannot find any positive reward signal to optimize and tries to mitigate negative reward accumulation by ending the episode as fast as possible (suicide).

I've often seen border/wall seeking when the action space collapses and the agent chooses the same action continuously. You can tell if this is happening by looking at entropy on tensorboard, it will crash to 0 and the policy will collapse and not recover.

mrshinx · May 30, 2023

The agents are given observation of their hunger and thirst meter, maximum value for each is 110. They decrease overtime and if any of the two drops below 50, the agent is given negative reward every step they stay hungry or thirsty this way.

If the agent stays in collision with food object as seen in above pictures or with a lake (a water source), they are given positive reward. However, when hunger or thirst meter reaches maximum value and the agent keeps touching food/water source, it is given negative reward to prevent overeating.

The wall-hugging behavior starts to appear around step 40M. I'm now at step 78M and the situation has slightly improved (some of the agents start to do something other than hugging wall) but overall the agents are not acting as intended. They either hug the walls, hug the lakes or chase the food endlessly despite being full (and is punished for doing that). I don't quite get why they cannot make a connection between chasing food and drinking water simultaneously and only stick to one at a time. It's been more than 48 hours of training now and I think this is taking too much time for a task like this.

I do use continuous action to control the agent direction (X and Y as floats), below is my config:

Code (CSharp):

hyperparameters:

batch_size: 1024

buffer_size: 10240

learning_rate: 0.0003

beta: 0.01

epsilon: 0.2

lambd: 0.95

num_epoch: 6

learning_rate_schedule: linear

network_settings:

normalize: false

hidden_units: 512

num_layers: 2

reward_signals:

extrinsic:

gamma: 0.99

strength: 1.0

time_horizon: 128

max_steps: 5.0e8

trainer_type: ppo

GamerLordMat · Jun 1, 2023

Try to set normalize to True and try again, for me this often made problems

mrshinx · Jun 1, 2023

GamerLordMat said: ↑

Try to set normalize to True and try again, for me this often made problems
Click to expand...

Thanks for the suggestion but I already normalized the observation space myself in the code.

So after adjusting the reward value I have got some better result

It now takes the agent ~7 hours before they start picking up the task. The problem was that overeating yields more negative reward than touching the wall. This made the agents think that touching food is bad (at least worse than touching wall) so they tried to avoid all kind of food and water. After making overeating and touching wall yield the same negative result, the performance is better as seen above.

Edit: Another change I made was to reduce the number of food objects around the agents. It seems that when the agents happen to overeat, they can't easily "get away" from the food since there are just so many objects surrounding them and since they are very close, raycast of agent is blocked. From this reason the agents tried to go to a clear spot to get away from the "dangerous" food

GamerLordMat · Jun 1, 2023

mrshinx said: ↑

Thanks for the suggestion but I already normalized the observation space myself in the code.

So after adjusting the reward value I have got some better result
View attachment 1250701

It now takes the agent ~7 hours before they start picking up the task. The problem was that overeating yields more negative reward than touching the wall. This made the agents think that touching food is bad (at least worse than touching wall) so they tried to avoid all kind of food and water. After making overeating and touching wall yield the same negative result, the performance is better as seen above.

Edit: Another change I made was to reduce the number of food objects around the agents. It seems that when the agents happen to overeat, they can't easily "get away" from the food since there are just so many objects surrounding them and since they are very close, raycast of agent is blocked. From this reason the agents tried to go to a clear spot to get away from the "dangerous" food
Click to expand...

great you figured it somewhat out then!

Luke-Houlihan · Jun 2, 2023

Interesting, glad you solved it.

You may see faster training by reducing the negative rewards even more, the intuition I use it that the good behavior signals need to drown out the bad ones in early training when the policy is still largely random. I find this also balances the explore/exploit balance a little better in later training.

Search Unity

Question Agents keep doing one thing that they are punished for

mrshinx

Attached Files:

upload_2023-5-30_9-20-23.png

Luke-Houlihan

mrshinx

GamerLordMat

mrshinx

GamerLordMat

Luke-Houlihan

Search Unity

Unity ID

Useful Searches

Question Agents keep doing one thing that they are punished for

Attached Files: