Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. We’re making changes to the Unity Runtime Fee pricing policy that we announced on September 12th. Access our latest thread for more information!
    Dismiss Notice
  3. Dismiss Notice

Agent is stuck on simple behavior

Discussion in 'ML-Agents' started by simon_winter, Feb 16, 2021.

  1. simon_winter


    Oct 8, 2018

    my Agent refuses to develop more sophisticated behavior and is stcuk with running randomly around and not driving into walls.

    Its task is to clean a room like a roomba cleaning robot. So, every uniqly passed floor is promoted, passing over same spot mutliple times is punished just like hitting walls. incase the robot has cleaned a certain percentage of the room or if is continusly hitting the wall or keeps driving over duplicated floor for a longer duration, the episode is ended and restarted. i'm trying this for weeks now and i cannot find any approach where the robot actually learns some sort of path planning. it avoids walls, but it seems to stear randomly through the room. avoiding old paths or even seeking unknown spots never evolves and the mean reward stagnates quickly.

    The following screenshots illustrate the problem during training. On the right you can see a minimap, which is given as visual observation in a 3 channel picture. Red is wall, blue is passable ground, black is unknown and green is path where the robot was. The minimap is centered on the robot and travels along. orientation, position, and some distance sensors are observations provided to the robot along.

    It drives around, and then gets caught in a corner where it starts clumping up driving the same area over and over. I also tried diffrent resolutions for the minimap and bigger field of vision and a S***ton of hyperparameter tweeking aswell as using SAC and PPO. The file for this training is attached.

    the blue pixels on the left are used for evaluation and are not known to the agent. basically its a 2d grid, when the robot moves the cleaning area below him gets roudned into that grid. onece cleaned it gets blue, duplciated cleaning of the same area is red. The nature of this is, that the agent gets rewarded while driving forward every 2-3 fixed updates, as he then moved enough to "clean" the next pixels. He does not get reward or penalty while not cleaning (other then a permanent tiny one, so he moves). When he passes over blue pixels the same process goes the other way around, every few fixed updates new pixels get colored and thus punish him, because he has been here already. So there is a small stepping involved and not fluent rewards, but i think that shouldn interfer with his learning.

    im happy for any ideas/suggestions on what to change, so that the agent starts optimizing his pathing to avoid duplicates and maximate ground coverage!

    - should i remove the endepisode when he stays in the same area for some minutes, so is forced to learn he has to clean the entire room?
    - some hyperparameters i dont get? i tried 2k time_horizon with high gamma already
    - do i must provide a rotating minimap with his robot position drawn in? it should realize it is in the center of that map at all times, right? and rotation is given in degrees, so he could also link that with the map.
    do visual observations jsut take forever to learn from? so do i need to keep trainign him for a long duration, despite mean reward not increasing constantly?


    This particular training is based on itself. so i started a SAC let it run for 825k steps, where it stagnated already and so i initilaized a ned SAC with same parameters from that first try. both graphs are here, where *GRAY* is the new try, stagnating again already

    upload_2021-2-16_18-57-49.png upload_2021-2-16_19-1-31.png

    Code (CSharp):
    1. behaviors:
    2.   explorer:
    3.     trainer_type: sac
    4.     hyperparameters:
    5.       batch_size: 512
    6.       buffer_size: 20480
    7.       buffer_init_steps: 10000
    8.       init_entcoef: 1
    9.       tau: 0.005
    10.       steps_per_update: 20
    11.       learning_rate_schedule: constant
    12.     network_settings:
    13.       normalize: false
    14.       hidden_units: 256
    15.       num_layers: 3
    16.       vis_encode_type: simple
    17.     reward_signals:
    18.       extrinsic:
    19.         gamma: 0.997
    20.         strength: 1.0
    21.     keep_checkpoints: 5
    22.     max_steps: 1000000
    23.     time_horizon: 256
    24.     summary_freq: 25000
    25.     threaded: true
  2. ruoping_unity


    Unity Technologies

    Jul 10, 2020
    A few things to clarify:
    What's the agent's observations? Does it have access to the map? Also does it contains the information of whether it has/hasn't been to a specific place before?
  3. simon_winter


    Oct 8, 2018
    The Agent observations are:
    - absolute orientation in degrees (0-360° ) normalized between 0 and 1
    - relative movement since last fixedUpdate in local coordiante system
    - 2 long range sensor arrays at the front which give rough distance to objects in a 45° cone divided into left right hemisphere (distnace noramlized between 0-1)
    - 3 raycassts at front, front-left, front-right which give precise distance
    - Bumper, whcih alerts the robot that it bumped into something

    - The minimap on the right side of the above screenshots. its given in a cusom sensor obersvation writer.
    its divided into 3 channels (rgb) where only 0 or 1 is possible.
    red is walls
    green is path where robot has been
    blue is ground where robot can go (no obstacle confirmed)
    black is unknown.
    The Map is centered on the robots position and scrolls while the robot drives, so robot is always in the middle but not drawn in explicitly. The map does not rotate or anything, the robot has to use the orientation observation to guess which direction hes going on the map


    The map is filled on the fly by the robots sensors. But the Agent has nothing to do with the filling, it happens automatically through the inforamtions gathered with the raycast sensors.
  4. simon_winter


    Oct 8, 2018
    The reward is given based on the reality (to the right side above). so there is a layer of imperfection between the minimap and the reality. The robot has to learn that he needs to keep distance to the green line
  5. ruoping_unity


    Unity Technologies

    Jul 10, 2020
    Thanks, I think I have a better understanding now.

    The problem here is that this task require a huge memory for the agent to "remember" the map since it's given only a part of it and the map is constantly changing while it moves. It also needs to explore AND remember the whole map to be able to do path planning here, while the exploration part is already hard enough because it gets penalties for going to places it has been to.
    Typically the agent only gets observations at current step to decide on the next action, so the observation at each step should contain all the information needed for the agent to decide action for that step, which is not the case here. There's a memory setting section in the config file which enables the agent to have some sort of memories and can make decisions based on a few more steps of past experience. Yet, given the amount of things the agent needs to remember I doubt that would solve the case.
    The easiest thing to do here is still feed the whole map and the position of agent, along with the state (visited or not) of each cell, as observation. The raycasts would be not necessary here since the map already has the same information.
  6. simon_winter


    Oct 8, 2018
    Thanks for your help!

    The raycasts are used, because the map has a crude resolution, thus 3 more observations fo rprecise wall evasion maneurvering. at some point i hoped it would skid along the walls to clean there to ect, for what the map might not be good enough. Is redundand information a big problem, or jsut a small performance upgrade?
    I will try giving the entire map with the agents position, but i guess that''ll be alot of info and so a long training time.

    Do you think having the rotation not in the map but as seperate (more precise) observation decreases learning effectivness too?

    and the map pixels are calculated based on real world coordiante data i get from the sensors. but to fit it in the grid, i simply floor it to whole numbers. as of this the pixels are sometimes off by 1 and small mistakes appear. jagged walls may overlap with the driven paths or things like taht. is that a huge thing, or is it okish for the robot to learn from a map with small deviations?
  7. ruoping_unity


    Unity Technologies

    Jul 10, 2020
    Extra raycast observations is not a problem, worst case is just agent ignoring it if the agent doesn't find it useful.
    And I think another rotated map is not necessary, as long as the agent has its current rotation.

    The last concern you have can be viewed as some kind of noise between the real map and the given/processed map. How much it affects the training depends on how noisy it is for the agent to make the right decision. If it's just a small part of it and the agent still get the correct reward most of the time, it should be fine. But if it results in a lot of false alarms it might affect performance.
  8. simon_winter


    Oct 8, 2018
    Thank you for your input, really helpful!
    i will adapt my training and post the results when training is progressed.
  9. simon_winter


    Oct 8, 2018
    I'm running repedatly out of memory with a 200x200x3 visual observation and 3 agents.
    it says something with a shape of (512,200,200,3), but i dont know wherer the 512 comes from.
    Does stacked observations multiply mey needed memory?
    does buffer/batch size multiply with my visual observation size somehow?
  10. ruoping_unity


    Unity Technologies

    Jul 10, 2020
    Looks like the 512 comes from your batch size.
    During model training, it pulls a batch of data from previously collected data points (including all the observations, actions, reward, etc) to update the model. This is the part that's most computational expensive and all the things like observation size/stacks/batch size will affect how much memory it needs.
    Generally speaking, bigger batch size helps with more stable training since it doesn't get biased a lot by single data point, and smaller batch size iterates faster with less computation resources.

    You'll either need to lower the observation sizes or batch sizes (, or a larger memory).
  11. simon_winter


    Oct 8, 2018
    using SAC on my local machine and PPO on a server for a single robot both yielded this clumping up behavior. Im trying now with curiosity and a higher penalty per step. Also i decresed the map size so i have 100x100 grid for observations and hopefully 3 robots

    upload_2021-2-18_9-19-24.png upload_2021-2-18_9-31-33.png upload_2021-2-18_9-33-40.png

    Attached Files:

  12. simon_winter


    Oct 8, 2018
    i tired multiple hyperaparmeters, but the agent now moves very slowly and cautious and also rarley moves out of the starting area. it cleans the same space over and over and refuses to get better. is it maybe something in my reward function?
    Code (CSharp):
    1. public override void OnActionReceived(float[] vectorAction) {
    2.         base.OnActionReceived(vectorAction);
    3.         stepReward = -0.01f;
    5.         float lWheel = GetThrottleFromIndex((int)vectorAction[0]);
    6.         float rWheel = GetThrottleFromIndex((int)vectorAction[1]);
    8.         var totalCleanedPerc = rm.Pathing.TotalCleanedPixels / rm.Pathing.TotalAvailablePixels;
    10.         if (rWheel == 2 || lWheel == 2) {
    11.             stepReward -= 0.5f;
    12.         }
    13.         else {
    14.             totalPixel += (ulong)rm.Pathing.CleanedPixels_Step;
    15.             totalDuplicatedPixel += (ulong)rm.Pathing.DuplicatedPixels_Step;          
    17.             statsRecorder.Add("PixelCleaned", totalPixel, StatAggregationMethod.MostRecent);
    18.             statsRecorder.Add("Duplicated PixelCleaned", totalDuplicatedPixel, StatAggregationMethod.MostRecent);
    19.             statsRecorder.Add("Percentage Cleaned", totalCleanedPerc, StatAggregationMethod.MostRecent);
    21.             if (rm.Pathing.CleanedPixels_Step == 0) {              
    22.                 FailedCleaningStacks++;
    23.             }
    24.             else {
    25.                 stepReward += (0.1f * rm.Pathing.CleanedPixels_Step);              
    26.             }
    28.             if (rm.Pathing.DuplicatedPixels_Step > rm.Pathing.DuplicatedPixels_Step) {
    29.                 stepReward = -0.2f;
    30.             }
    31.         }
    33.         if (rm.Bumper.Triggered) {
    34.             stepReward = -0.5f;          
    35.         }      
    37.         var throttleL = Mathf.Clamp(lWheel, -1, 1);
    38.         var throttleR = Mathf.Clamp(rWheel, -1, 1);
    40.         rm.CarController.Throttle(throttleL, throttleR);
    42.         //// Fell off platform
    43.         if (transform.localPosition.y < -10) {          
    44.             SetReward(0);
    45.             EndEpisode();
    46.         }
    48.         SetReward(stepReward);
    50.         if (rm.Pathing.TotalAvailablePixels != 0 &&
    51.             (rm.Pathing.TotalCleanedPixels / rm.Pathing.TotalAvailablePixels) > Mathf.Lerp(0.6f, 0.9f, difficulty)) {
    52.             EndEpisode();
    53.         }
    54.         if (debug) {
    55.             Debug.Log("cleaned perc: " + Math.Round(totalCleanedPerc,2) + " AddReward: " + GetCumulativeReward());
    56.         }
    57.         if (stepReward >= 0.1f) {
    58.             GetComponentInChildren<Renderer>().material.SetColor("_Color", new Color(0, stepReward, 0));
    59.         }
    60.         else if(stepReward <= -0.1f){
    61.             GetComponentInChildren<Renderer>().material.SetColor("_Color", new Color(-stepReward, 0, 0));
    62.         }
    63.     }
  13. ruoping_unity


    Unity Technologies

    Jul 10, 2020
    A few things I can tell from the information you put out:
    1. I don't think the problem here can be solved by adding curiosity. Curiosity is useful when the reward is very sparse and there's not enough extrinsic reward to drive your agent's learning so you want to use the intrinsic reward to encourage exploration. Here I'd say the reward is not sparse and introducing extra curiosity reward can introduce survival bias - since curiosity usually a positive reward, if agent got punished a lot for doing any other thing, it'd just do nothing and stay alive so it gets unlimited curiosity reward bonus until time out.
    2. If I'm understanding correctly your reward is based on overall progress not the progress in particular step, which can be misleading to the agent. What the reward means in the training is that, how good the action is, based on the given observation. So if the agent is not stepping on a visited tile and cleaned something it should get a full positive reward for that step, instead of a very little increase measured by overall progress.
    3. Not very sure on this point since I don't have full context of the code, but I feel that the agent gets a lot of penalties and the reward is almost always negative. If that's the case the agent might think doing anything is bad and it would better just do nothing. That might be the cause of you seeing it rarely moves or very slowly.
  14. simon_winter


    Oct 8, 2018
    Again i appreciate the insight you provide. Made curisoty and rewards a lot clearer!

    The current implmentation is a special case now, but in general my code works like:
    -every step AddReward(-0.01)
    -if pixels cleaned in this step: reward = 1
    -if pixels cleaned are duplicated: reward -= (0.1/0.5/1)
    -if standing still += 0

    indeed the culumative reward is neagtive in the thousands and slowly climbs till it tangents a small positive value like 50.
    a good run (cleaning the room with optimal pathing, not overlapping paths) should grant a reward in the thounds or more.

    so how can i improve the envirnment concept wise, that the agent develops a smart pathing? i have to punish him for duplicate and bumping. I have the feeling when i use too small punishment for duplicated pixels it'll just zoom around the room and never really feeling pushed to avoid old paths.