Search Unity

Question Agents keep doing one thing that they are punished for

Discussion in 'ML-Agents' started by mrshinx, May 30, 2023.

  1. mrshinx

    mrshinx

    Joined:
    Dec 6, 2018
    Posts:
    3
    I'm trying to train my rabbit agents to collect food in a closed environment. There are four walls around the area that the agents should not touch. If they stay in contact with the wall, they are punished every step.

    After some time, instead of trying to collect food (which is available everywhere), they all start to hug the wall despite being heavily punished while doing so. I just can't understand this kind of behavior, seems like they are trying to pursuit the most negative reward. As seen in the picture, the rabbits are touching the top-left wall corner.

    upload_2023-5-30_9-21-8.png
    upload_2023-5-30_9-22-12.png

    And yes, they have ray perception censor to detect the walls

    upload_2023-5-30_9-23-21.png

    I'm at step 62M (about 1.7 days of training) and they are having a lot of negative reward because of this

    upload_2023-5-30_9-24-7.png

    Any help is appreciated!
     

    Attached Files:

  2. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    Can you provide an overview of other positive/negative rewards your environment is assigning?

    This often happens in cases where the policy cannot find any positive reward signal to optimize and tries to mitigate negative reward accumulation by ending the episode as fast as possible (suicide).

    I've often seen border/wall seeking when the action space collapses and the agent chooses the same action continuously. You can tell if this is happening by looking at entropy on tensorboard, it will crash to 0 and the policy will collapse and not recover.
     
  3. mrshinx

    mrshinx

    Joined:
    Dec 6, 2018
    Posts:
    3
    The agents are given observation of their hunger and thirst meter, maximum value for each is 110. They decrease overtime and if any of the two drops below 50, the agent is given negative reward every step they stay hungry or thirsty this way.

    If the agent stays in collision with food object as seen in above pictures or with a lake (a water source), they are given positive reward. However, when hunger or thirst meter reaches maximum value and the agent keeps touching food/water source, it is given negative reward to prevent overeating.

    The wall-hugging behavior starts to appear around step 40M. I'm now at step 78M and the situation has slightly improved (some of the agents start to do something other than hugging wall) but overall the agents are not acting as intended. They either hug the walls, hug the lakes or chase the food endlessly despite being full (and is punished for doing that). I don't quite get why they cannot make a connection between chasing food and drinking water simultaneously and only stick to one at a time. It's been more than 48 hours of training now and I think this is taking too much time for a task like this.

    upload_2023-5-30_19-28-53.png

    I do use continuous action to control the agent direction (X and Y as floats), below is my config:

    Code (CSharp):
    1. hyperparameters:
    2.       batch_size: 1024
    3.       buffer_size: 10240
    4.       learning_rate: 0.0003
    5.       beta: 0.01
    6.       epsilon: 0.2
    7.       lambd: 0.95
    8.       num_epoch: 6
    9.       learning_rate_schedule: linear
    10.     network_settings:
    11.       normalize: false
    12.       hidden_units: 512
    13.       num_layers: 2
    14.     reward_signals:
    15.       extrinsic:
    16.         gamma: 0.99
    17.         strength: 1.0
    18.     time_horizon: 128
    19.     max_steps: 5.0e8
    20.     trainer_type: ppo
     
    Last edited: May 30, 2023
  4. GamerLordMat

    GamerLordMat

    Joined:
    Oct 10, 2019
    Posts:
    185
    Try to set normalize to True and try again, for me this often made problems
     
  5. mrshinx

    mrshinx

    Joined:
    Dec 6, 2018
    Posts:
    3
    Thanks for the suggestion but I already normalized the observation space myself in the code.

    So after adjusting the reward value I have got some better result
    upload_2023-6-1_21-44-34.png

    It now takes the agent ~7 hours before they start picking up the task. The problem was that overeating yields more negative reward than touching the wall. This made the agents think that touching food is bad (at least worse than touching wall) so they tried to avoid all kind of food and water. After making overeating and touching wall yield the same negative result, the performance is better as seen above.

    Edit: Another change I made was to reduce the number of food objects around the agents. It seems that when the agents happen to overeat, they can't easily "get away" from the food since there are just so many objects surrounding them and since they are very close, raycast of agent is blocked. From this reason the agents tried to go to a clear spot to get away from the "dangerous" food
     
    Last edited: Jun 1, 2023
  6. GamerLordMat

    GamerLordMat

    Joined:
    Oct 10, 2019
    Posts:
    185
    great you figured it somewhat out then!
     
  7. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    Interesting, glad you solved it.

    You may see faster training by reducing the negative rewards even more, the intuition I use it that the good behavior signals need to drown out the bad ones in early training when the policy is still largely random. I find this also balances the explore/exploit balance a little better in later training.