Search Unity

GAIL vs Behavioral Cloning, what's the difference?

Discussion in 'ML-Agents' started by mbaske, Aug 3, 2020.

  1. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    473
    I couldn't really find a detailed explanation in the docs. Some of the imitation config files (https://github.com/Unity-Technologies/ml-agents/tree/master/config/imitation) like Crawler include both, others like Pushblock just the GAIL reward signal.
    How exactly do GAIL and behavioral cloning differ? When do I use which?
    For my current project, I'd like my agent to start training with recorded demo data exclusively, and then gradually shift to training with extrinsic rewards.
    Thanks!
     
  2. YunaoShen

    YunaoShen

    Joined:
    Sep 9, 2019
    Posts:
    10
    There was once a doc talking about gail and BC, but in the recent release I can't find it. Here's the link:https://github.com/Unity-Technologies/ml-agents/blob/0.15.0/docs/Training-Imitation-Learning.md
    BC will train your agent to mimic the demos. So you need a lot of demos to make it works.
    Gail is more flexible in some way. It's training another nn to evaluate how close the agents behaves compared to the demos. Gail could work well even there is only limited number of demos. Besides gail could work well togather with extrinsic reward while BC seems not that good.
    Since you want to shift to extrinsic rewards as training goes on, I think gail may be a better approach here.
     
    Mehrdad995, Hsgngr and mbaske like this.
  3. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    473
  4. celion_unity

    celion_unity

    Joined:
    Jun 12, 2019
    Posts:
    289
    The section in the docs that @YunaoShen referred to is now here: https://github.com/Unity-Technologi...docs/ML-Agents-Overview.md#imitation-learning

    If your end goal is to improve reinforcement learning training and maximize rewards, then I agree, GAIL+extrinsic reward is probably what you want, with possibly some BC training.

    GAIL without an extrinsic reward should produce behavior that "acts like" the demonstration data, but this won't necessarily maximize the environment reward.
     
    mbaske likes this.
  5. YunaoShen

    YunaoShen

    Joined:
    Sep 9, 2019
    Posts:
    10
    No problem. My last project also use gail so I still remember some.
     
  6. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    473
    Great, thanks! Must have missed that somehow...
    Looks like once I've chosen my reward signals, I need to stick to them. If I start training with GAIL enabled, then I can't pause later, comment it out in the config file and resume. Trying that gives me an error saying the configurations don't match.
    Right now, I only need GAIL for the initial training phase. I'm forcing my agent to mimick the demo without relying on extrinsic rewards. After a while, I'm reversing this: once the agent has learned the basic demo skills, it continues learning new ones using extrinsic rewards only. My current workaround here is to just swap the strength values at this point, changing GAIL 1 / extrinsic 0 to GAIL 0 / extrinsic 1.
    While this seems to work fine, it doesn't stop GAIL from still doing its thing, I'm seeing "GAIL Expert Estimate" and "GAIL Policy Estimate" progressing in Tensorboard. "GAIL Reward" has dropped to zero though as expected. I wonder if it would be practical to suspend any python side logic if its associated reward signal strength is set to zero? My naive assumption being that might free up resources and perhaps accelerate training?
     
  7. celion_unity

    celion_unity

    Joined:
    Jun 12, 2019
    Posts:
    289
    So you're training for a while, then stopping and running with --initialize-from and the modified weights?

    I think you're right that we might be able to save a few cycles evaluating the reward signal when the weight is 0 (somewhere around here).

    I also think the "right" way to solve this is to offer a GAIL pretraining option, similar to BC. I'll log a feature request for this, but can't promise that we'll get to it anytime soon (especially since all this code is getting ported to pytorch as we speak)...
     
    mbaske likes this.
  8. celion_unity

    celion_unity

    Joined:
    Jun 12, 2019
    Posts:
    289
    Internal tracked ID for GAIL pretraining is MLA-1249
     
  9. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    473
    I just resume training with the same model. As far as I can tell, --initialize-from would still require the same configuration anyway.
    OK thanks, I'll take a look.
     
  10. mbaske

    mbaske

    Joined:
    Dec 31, 2017
    Posts:
    473
    Correction: Apparently I can disable GAIL after the initial training phase, if I remove its reward signal from my trainer_config.yaml as well as the configuration.yaml file in the results folder. This way, I'm not getting a configuration mismatch error and all Tensorboard graphs for GAIL just stop at this point. The model then resumes training with extrinsic rewards only as I was hoping it would.
     
    Neohun likes this.
  11. jokerHHH

    jokerHHH

    Joined:
    May 11, 2021
    Posts:
    9
    @celion_unity
    Hi, I use the BC only to pretrain my agent, and after 500,000 steps, I get the fairly good results. And then I use gail only, close BC and extrinsic reward, I just comment out the BC and set the strength of extrinsic to 0. After a while, the mean reward goes down, I try to use PPO only and get the same result. And I think may be the steps in BC trainning are not enough. Then I start over to train my pretrain model, after 1,000,000 steps, the mean reward goes down again. Why did this happen? Perhaps my demonstrations is not good enough? I already collect about 470 episode on my game. My game like the example Match3, but more complex.
     
  12. ervteng_unity

    ervteng_unity

    Unity Technologies

    Joined:
    Dec 6, 2018
    Posts:
    150
    Hey jokerHHH, can you post a picture of your reward curve? It is possible the agent needs to unlearn a bit before going back up, especially if BC strength was high in the beginning.