Search Unity

  1. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

Problem with training self-driving car

Discussion in 'ML-Agents' started by oleg_v, Jul 11, 2020.

  1. oleg_v


    Nov 10, 2017
    Hi All!

    I have the following environment:
    • Self-driving car (gas, brake, steering)
    • Path of waypoints
    • Static obstacles on path

    Initial goal of training was to drive through looped path with max speed. 2-3M steps lead to more or less realistic behaviour.

    Current goal is to avoid obstacles.

    • gas/brake
    • steering angle
    • several ahead points param to predict turns (curvature)
    • several raycasts with static obstacles
    Total: vector of 23 floats

    Actions (Continuous)
    • gas/brake
    • steering
    Total: 2

    Reward system
    To calculate reward I used the following params:
    • directionError: [-1;1] => [180;0] (unsigned angle between car "forward" and path direction)
    • gasBrakeBonus = speedIsLow ? gas - brake : 0f (a bit more complex to make it smooth)
    • outOfPathPenalty = distanceFromPath > 2 ? ((distanceFromPath - 2) / maxDistanceFromPath) (a bit more complex too to make it smooth)
    Reward/penalty are splitted as following:
    • Reward on each step = speed * directionError + gasBrakeBonus - outOfPathPenalty (and divided by some factor to reduce each step reward and make critical outs or finish more meaningful)
    • Penalty on low speed during N seconds = -1, end of episode
    • Penalty on critically wrong direction (>90 deg) = -1, end of episode
    • Penalty on critical out of path = -1, end of episode (when distanceFromPath > maxDistanceFromPath)
    • Penalty on collision with obstacle = -1, end of episode
    • Checkpoint reward (in between of obstacles) = 0.1f
    • Reward on finish of path = 1f, end of episode

    Rewards are aligned with following assumption:
    1. Fast pass track w/o missed turns and w/o collisions with obstacles - it's a goal, reward must be maximum (2.3 for 310 steps)
    2. Slowly pass track to finish of path (2.3 for 630 steps)
    3. Driving fast and out (or collide) (-0.01 for 340 steps)
    4. Driving slowly and out (or collide) (-0.69 for 301 steps, -0.04 for 431 steps)
    5. Standing still on start line (-0.99 for 252 steps)

    So it seems reward is shaped good towards main goal#1, but training is not converge:

    Could please someone point me what am i doing wrong or provide any suggestion?

    I tried different configs, different reward system from totally sparse to much more supervised learning, but 5-8M steps do not lead to goal, car collides with 1st obstacle.
    One note about train simplier case - just driving forward on minimal speed. It take about 100-300k steps to learn. And it can be faster with GAIL (commented), but GAIL do not bring success to avoid obstacles.

    Code (Boo):
    1. behaviors:
    2.   Vehicle:
    3.     trainer_type: sac
    4.     hyperparameters:
    5.       learning_rate: 0.0003
    6.       learning_rate_schedule: constant
    7.       batch_size: 128 #must be thousands for continuous actions (128-1024 for SAC)
    8.       buffer_size: 100000 #50k-1m for SAC, must be thousands of times larger then avg episode length
    9.       buffer_init_steps: 0 #1k-10k to help prefill exp buffer with random actions
    10.       tau: 0.005
    11.       steps_per_update: 10.0 #more value -> more efficient samples, =count of agents, high value -> high CPU
    12.       save_replay_buffer: false #dump experience buffer prior to exit training and load on resume
    13.       init_entcoef: 0.5 #0.5 - 1.0 for continuous, high value -> high exploration on begining -> faster training
    14.       reward_signal_steps_per_update: 10.0 #in general = steps_per_update
    15.     network_settings:
    16.       normalize: false
    17.       hidden_units: 128
    18.       num_layers: 2
    19.       vis_encode_type: simple
    20. #      memory:
    21. #        sequence_length: 32
    22. #        memory_size: 128
    23.     reward_signals:
    24.       extrinsic:
    25.         gamma: 0.99
    26.         strength: 1.0
    27. #      gail:
    28. #        strength: 0.1
    29. #        gamma: 0.99
    30. #        encoding_size: 128
    31. #        demo_path: d:/UnityProjects/RaceExperiment/Assets/MLDemos/
    32. #        learning_rate: 0.0003
    33. #        use_actions: false
    34. #        use_vail: false
    35.     keep_checkpoints: 5
    36.     max_steps: 12000000
    37.     time_horizon: 128 #low value -> less variety, more bias, low value can be better for high freq rewards
    38.     summary_freq: 3000
    39.     threaded: true
    40. #    behavioral_cloning:
    41. #      demo_path: d:/UnityProjects/RaceExperiment/Assets/MLDemos/DemoPath104.demo
    42. #      steps: 0
    43. #      strength: 1.0
    44. #      samples_per_update: 0

    Attached Files:

  2. mbaske


    Dec 31, 2017
    The observations look ok to me. Waypoints are in the agent's local space, right?
    You have quite a few rewards and penalties though. Each one introduces a potential risk of the agent exploiting some design flaw you might have overlooked. Many rewards also make the cumulative reward graph harder to interpret. I think you can simplify/combine most of the rewards you've listed by just setting a single one proportional to the dot product of the car's velocity and the path direction. Maybe try constraining the car's range of motion by placing barriers at the sides of the road and treat them like obstacles, so they can be detected by the raycasts.
  3. oleg_v


    Nov 10, 2017
    Thank You for answer!
    Waypoints passed to observations just as consequent angles (floats) between car direction and direction in ahead points.

    What do you mean by "quite a few rewards and penalty"? Is it about "Reward on each step = speed * directionError + gasBrakeBonus - outOfPathPenalty" or overall rewards?
    All other rewards except "checkpoints" are leads to end of episode (mark for learning to stop way to negative experience).
    I used "opened" roads, w/o barriers and that's why I need to force agent to move along the path by rewards and penalties by distance from path. There is observation + small smooth penalty when distance increasing + total large (-1) penalty for completely out of path cases. Therefore I'm using raycasting for obstacles and reward "path" for track. It seems completely orthogonal goals with separate observations for track and obstacles.

    Is there any way to learn separately independent experiences and "union" them together later?
    I'm afraid to get NN degradation if I'll start from learning "test" tracks (gas/brake/steering with angle prediction) and continue with easy track but obstacles.
  4. andrewcoh_unity


    Unity Technologies

    Sep 5, 2019
    I think @mbaske is giving good advice here. Your reward function is complicated though it may correctly encourage the behavior you expect with some tuning. My advice would be to simplify the reward function to ensure that you can train properly in a setting thats more human understandable and then begin to shape as you desire changes in behavior. You may find that for some behaviors, no shaping is required!
  5. oleg_v


    Nov 10, 2017
    Simplified step reward function to
    distance * scaleFactor
    . And got 400k steps to learn car to at least move forward a bit.
    Also tried the following:
    • Changed observations to "vector" form: a fewer raycasting + 5 relative points + 5 relative directions ahead on path (
    • Changed observations back to "distance" from path + 5 relative directions. (Currently 28 float observations).
    • Changed SetReward to AddReward and step
      reward = (distance + gasBrakeLowSpeed) * scaleFactor
      . Where
      gasBrakeLowSpeed = speed < minSpeed ? isBraking ? -1 : isGas ? 1 : 0 
      (linear version ofcourse).
    In general no converge, car stands on start line, driving out of path (multiple times with same directions), crashes with 1st obstacle (0.5-1.5M steps). Very likely to local extrememum but how to find it?

    And again, i added all of "critical" penalties:
    • low speed for 10 seconds = -1, EndEpisode
    • out of path: -0.1f, EndEpisode
    • wrong direction: -0.2f, EndEpisode
    • crash: -0.2f, EndEpisode
    I don't know how can I simplify reward more.

    From the other side:
    • GAIL is required (all demos contain about 50-70 episodes, successful or not), car can't even start w/o it (at least with provided sac config)
    • gail/strength and learning_rate must be less in order to Gail Loss converge
    • Decision period must be tuned, i had "1" and it took too much time to shift a bit forward
    • Smaller buffer_size making car more alive instead of repeat wrong actions
    PS. One run found bug in code :) On each reset of episode, car was placed to "0" point of track and because of real physics with +/- delta. For "-" cases reward was larger and i got converge to best strategy: stand on start line :)))))