Search Unity

  1. We are migrating the Unity Forums to Unity Discussions. On July 12, the Unity Forums will become read-only. On July 15, Unity Discussions will become read-only until July 18, when the new design and the migrated forum contents will go live. Read our full announcement for more information and let us know if you have any questions.

Overcoming inconsistent learning with indirect rewards during first phase (self driving car)

Discussion in 'ML-Agents' started by sanguinar, May 5, 2020.

  1. sanguinar


    Mar 13, 2020
    Hi all

    Successfully training a self-driving car on a racetrack, I encountered highly irregular learning curves when running the same training environment multiple times: Sometimes the car would successfully race the course after 1M cycles, sometimes it would still be stuck on the start line after 3M, trying to optimize a negative reward value by going 'slowly backwards' and 'minimizing the damage' (sub-optimum).
    2signals_1track_bad.JPG 2signals_1track_good.JPG

    The inherent problem was, that my main reward is an indirect one: Because i wanted a fast&furious agent, I coupled AddReward to the velocity of the rigid body (positive and negative magnitude) in relation to car.transform.forward. That allows sliding but means that the car has to move forward to realize that the reward goes up. And here's the problem....

    During the high-entropy first phase of learning, throttle and steer get yanked all around, the wheel colliders spin/slip in place and its actually a question of getting lucky that the overall movement of the RB is consistently 'forward'. Once it starts to slide backwards, i noticed the action bias going clearly into the negatives. But if the agent 'gets it' and knows that 'pedal to the metal' is the right strategy, the rest is a piece of cake: look at the rays and stay on track. That was usually accomplished after 500k decisions.

    So what are the options for similar systems to properly get through that first phase without changing the model, curiosity etc.:
    - prohibiting/blocking unwanted movement completely (sort of unsexy. rewards should handle that)
    - coupling a reward to the throttle/action directly (works 100%, but is really unsexy if the car ends up on a stone, spinning wheels for reward without moving)
    - terminating the episode if the car velocity is too much backwards (helps, but not all the way)
    - using a more complex effect chain than agent>torque>wheels>movement (difficult, because reward might be even more indirect)
    - reducing unwanted movement by dividing the effect by x (doesn't change the inherent problem)
    - reducing slip to get a more direct result between action and signal (that in fact helps to a certain degree, but depending on the setup it might become a problem later)
    - using a curriculum to guide any of these parameters/ideas across the first phase.

    Other options to optimize the learning curve in general we explored:
    - seeding a range of values (throttle/steer) to multiple cars in parallel:
    sort of successful because in my case, high torque gets the RB moving and probability to end in sub-optimum is reduced. and all 'speed' variants are explored in parallel.
    - assigning variants of values (throttle/steer) for every episode:
    pretty successful, because direct effects become very visible.
    - having different track shapes for multiple cars in parallel
    no effect in initial phase but massively helps with learning how to steer once it knows how to drive forward.


    Any comments and ideas are appreciated and KOUDOS to the ml-agents team!
  2. mbaske


    Dec 31, 2017
    That sounds like you're applying continuous rather than discrete actions, and I can see how that could be a problem with high entropy. Maybe try a more indirect approach with a normalized 2D vector controlling the forces. Have your agent update that vector in small discrete steps. This should prevent drastic direction changes and get you more realistic behaviour overall. Let's say you have two action branches, one for steering and one for acceleration/braking. Then you could do something like this
    Code (CSharp):
    1. public override void OnActionReceived(float[] actions)
    2. {
    3.     float dt = Time.fixedDeltaTime;
    5.     ctrlVector.x += steeringIncrement * dt * Mathf.RoundToInt(actions[0] - 1);
    6.     ctrlVector.x = Mathf.Clamp(ctrlVector.x, -1f, 1f);
    7.     ctrlVector.y += accelerationIncrement * dt * Mathf.RoundToInt(actions[1] - 1);
    8.     ctrlVector.y = Mathf.Clamp(ctrlVector.y, -1f, 1f);
    9. }
    where the action values are 0 = left/back, 1 = neutral/neutral, 2 = right/forward.
    With this setup, you could also try adding a small bias to ctrlVector.y, so there's a higher chance of initial forward movement.
  3. sanguinar


    Mar 13, 2020
    You are absolutely right. But I'm running a value-in > effect-out chain with the idea of allowing direct changes in short amounts of time (eg when driving at high speeds). And yes, continous space for the two control inputs speed and steer.

    What you are describing would fit under my point "using a more complex effect chain". That can include some kind of dampening of the agent's continuous decisions or switching to discrete as you're proposing. That would then compare to increase, decrease and don't change velocity (in small steps), if I'm using the right analogy.

    I was under the assumption that ml-agents like a clear input-output chain for those kinds of problems. no extra hidden layers and minimal amounts of observation vectors. In my case, i use two ray perception sensors that hit track/off-track objects plus the speed magnitude and the resulting steer value (different cars might have different max-steers and different max torques). Once the kid is rolling, that works perfectly....

    Btw: I'm using discreet decisions in another project successfully, but i was pretty sure that for this case, continous would be the right approach.
  4. ervteng_unity


    Unity Technologies

    Dec 6, 2018
    One approach that works well in this type of scenario is to use curriculum learning (see: You'd give the agent a reward for throttle in the beginning stages of learning using the curriculum, that is then reduced during the training until it's 0. By then hopefully the agent has learned to step on the pedal to go forward, but won't be maximizing that anymore.
  5. sanguinar


    Mar 13, 2020
    Correct, that's one option I've listed. Limiting the hard way or with curriculum still seems the best option. Curriculum has the advantage that the limitation can be reduced/removed gradually/totally.
    I was trying to document some of the options and open the discussion. Might be helpful if one is new to agents and irritated why it behaves the exact opposite way than what the rewards are trying to make him.