Search Unity

Resolved Curriculum learning not being triggered?

Discussion in 'ML-Agents' started by kt66nfkim, Mar 30, 2021.

  1. kt66nfkim

    kt66nfkim

    Joined:
    Dec 14, 2020
    Posts:
    5
    Hi

    Is it possible to mix curriculum learning with imitation learning? Maybe I'm configuring things wrong, but when I append my curriculum learning configs to imitation learning configs, it appears the curriculum part never gets triggered.

    this is the config I'm using. Any hints or help would be appreciated.


    behaviors:
    PongBehavior:
    trainer_type: ppo
    hyperparameters:
    batch_size: 256
    buffer_size: 10240
    learning_rate: 0.0003
    beta: 0.005
    epsilon: 0.2
    lambd: 0.95
    num_epoch: 3
    learning_rate_schedule: linear
    network_settings:
    normalize: true
    hidden_units: 128
    num_layers: 3
    reward_signals:
    extrinsic:
    gamma: 0.99
    strength: 1.0
    gail:
    gamma: 0.99
    strength: 0.5
    encoding_size: 128
    learning_rate: 0.0003
    demo_path: /home/Documents/ML-agents/ml-agents/CarPong/Assets/Demonstrations/PongBehavior.demo
    behavioral_cloning:
    strength: 0.9
    demo_path: /home/Documents/ML-agents/ml-agents/CarPong/Assets/Demonstrations/PongBehavior.demo
    keep_checkpoints: 5
    max_steps: 1000000
    time_horizon: 64
    summary_freq: 10000
    threaded: true
    environment_parameters:
    drone_targets_widths:
    curriculum:
    - name: Lesson0
    completion_criteria:
    measure: progress
    behavior: PongBehavior
    signal_smoothing: true
    min_lesson_length: 10000
    threshold: 0.1
    value:
    sampler_type: uniform
    sampler_parameters:
    min_value: 0.0
    max_value: 0.0
    - name: Lesson1
    completion_criteria:
    measure: progress
    behavior: PongBehavior
    signal_smoothing: true
    min_lesson_length: 10000
    threshold: 0.3
    value:
    sampler_type: uniform
    sampler_parameters:
    min_value: 0.0
    max_value: 5.0
    - name: Lesson2
    completion_criteria:
    measure: progress
    behavior: PongBehavior
    signal_smoothing: true
    min_lesson_length: 10000
    threshold: 0.65
    value:
    sampler_type: uniform
    sampler_parameters:
    min_value: 0.0
    max_value: 15.0
    - name: Lesson3
    completion_criteria:
    measure: progress
    behavior: PongBehavior
    signal_smoothing: true
    min_lesson_length: 10000
    threshold: 0.8
    value:
    sampler_type: uniform
    sampler_parameters:
    min_value: 0.0
    max_value: 25.0
    - name: Lesson4
    completion_criteria:
    measure: reward
    behavior: PongBehavior
    signal_smoothing: true
    min_lesson_length: 10000
    threshold: 200
    value:
    sampler_type: uniform
    sampler_parameters:
    min_value: 0.0
    max_value: 50.0





    Thanks!
     
    Last edited: Mar 30, 2021
  2. christophergoy

    christophergoy

    Unity Technologies

    Joined:
    Sep 16, 2015
    Posts:
    735
    Hi @kt66nfkim,
    This sounds reasonable to me. I'm reaching out some folks on the research team to find out.

    When you say the curriculum part never gets triggered, do you mean that you never leave lesson one even if your steps go over 100k? I see that your max_steps are set to 1,000,000 and your threshold for lesson 0 is .1. So you'd need to wait for 100k steps to move to the next lesson. Is that correct?
     
  3. kt66nfkim

    kt66nfkim

    Joined:
    Dec 14, 2020
    Posts:
    5
    Hi @christophergoy

    Yes, that's exactly what I'm seeing. Even after 100k steps the environment never moves onto the next lesson. Here's a small sample output from one training run I tried using the config file I shared earlier. You can see that even after 200k steps the environment stays in the same lesson.

    I also double checked with and used similar config on PPO and it seems to be working fine so I'm little lost on why it doesn't seem to work with imitation learning.


    2021-03-30 12:27:17 INFO [stats.py:139] PongBehavior. Step: 10000. Time Elapsed: 72.417 s. Mean Reward: 137.500. Std of Reward: 131.696. Training.
    2021-03-30 12:28:07 INFO [stats.py:139] PongBehavior. Step: 20000. Time Elapsed: 122.900 s. Mean Reward: 12.500. Std of Reward: 92.702. Training.
    2021-03-30 12:28:59 INFO [stats.py:139] PongBehavior. Step: 30000. Time Elapsed: 174.495 s. Mean Reward: 227.080. Std of Reward: 195.504. Training.
    2021-03-30 12:29:50 INFO [stats.py:139] PongBehavior. Step: 40000. Time Elapsed: 226.001 s. Mean Reward: 455.867. Std of Reward: 320.745. Training.
    2021-03-30 12:30:42 INFO [stats.py:139] PongBehavior. Step: 50000. Time Elapsed: 278.101 s. Mean Reward: 779.522. Std of Reward: 340.896. Training.
    2021-03-30 12:31:34 INFO [stats.py:139] PongBehavior. Step: 60000. Time Elapsed: 330.013 s. Mean Reward: 762.778. Std of Reward: 388.825. Training.
    2021-03-30 12:32:26 INFO [stats.py:139] PongBehavior. Step: 70000. Time Elapsed: 381.935 s. Mean Reward: 945.360. Std of Reward: 452.997. Training.
    2021-03-30 12:33:18 INFO [stats.py:139] PongBehavior. Step: 80000. Time Elapsed: 433.979 s. Mean Reward: 693.067. Std of Reward: 440.927. Training.
    2021-03-30 12:34:10 INFO [stats.py:139] PongBehavior. Step: 90000. Time Elapsed: 485.811 s. Mean Reward: 1177.640. Std of Reward: 428.542. Training.
    2021-03-30 12:35:02 INFO [stats.py:139] PongBehavior. Step: 100000. Time Elapsed: 537.423 s. Mean Reward: 1004.973. Std of Reward: 457.758. Training.
    2021-03-30 12:35:54 INFO [stats.py:139] PongBehavior. Step: 110000. Time Elapsed: 589.437 s. Mean Reward: 635.436. Std of Reward: 326.039. Training.
    2021-03-30 12:36:46 INFO [stats.py:139] PongBehavior. Step: 120000. Time Elapsed: 641.146 s. Mean Reward: 903.590. Std of Reward: 509.251. Training.
    2021-03-30 12:37:38 INFO [stats.py:139] PongBehavior. Step: 130000. Time Elapsed: 693.732 s. Mean Reward: 590.364. Std of Reward: 313.053. Training.
    2021-03-30 12:38:30 INFO [stats.py:139] PongBehavior. Step: 140000. Time Elapsed: 746.029 s. Mean Reward: 822.222. Std of Reward: 428.895. Training.
    2021-03-30 12:39:22 INFO [stats.py:139] PongBehavior. Step: 150000. Time Elapsed: 797.526 s. Mean Reward: 862.636. Std of Reward: 306.899. Training.
    2021-03-30 12:40:14 INFO [stats.py:139] PongBehavior. Step: 160000. Time Elapsed: 849.556 s. Mean Reward: 714.709. Std of Reward: 436.609. Training.
    2021-03-30 12:41:08 INFO [stats.py:139] PongBehavior. Step: 170000. Time Elapsed: 903.538 s. Mean Reward: 951.009. Std of Reward: 417.411. Training.
    2021-03-30 12:42:02 INFO [stats.py:139] PongBehavior. Step: 180000. Time Elapsed: 957.181 s. Mean Reward: 780.618. Std of Reward: 340.392. Training.
    2021-03-30 12:42:55 INFO [stats.py:139] PongBehavior. Step: 190000. Time Elapsed: 1010.496 s. Mean Reward: 711.111. Std of Reward: 414.848. Training.
    2021-03-30 12:43:48 INFO [stats.py:139] PongBehavior. Step: 200000. Time Elapsed: 1063.520 s. Mean Reward: 432.400. Std of Reward: 268.492. Training.
    2021-03-30 12:44:40 INFO [stats.py:139] PongBehavior. Step: 210000. Time Elapsed: 1115.521 s. Mean Reward: 535.240. Std of Reward: 322.129. Training.
    2021-03-30 12:45:31 INFO [stats.py:139] PongBehavior. Step: 220000. Time Elapsed: 1167.089 s. Mean Reward: 561.925. Std of Reward: 486.984. Training.
    2021-03-30 12:46:23 INFO [stats.py:139] PongBehavior. Step: 230000. Time Elapsed: 1218.883 s. Mean Reward: 441.390. Std of Reward: 400.383. Training.
    2021-03-30 12:47:15 INFO [stats.py:139] PongBehavior. Step: 240000. Time Elapsed: 1270.426 s. Mean Reward: 780.022. Std of Reward: 365.279. Training.



     
  4. christophergoy

    christophergoy

    Unity Technologies

    Joined:
    Sep 16, 2015
    Posts:
    735
    Someone on our research team just tried it out with the Pushblock environment and it worked for them. So maybe there is a formatting issue in your yaml? I'm not quite sure.

    Do you see the curriculum printed out when you start training as part of the training config?
     
  5. kt66nfkim

    kt66nfkim

    Joined:
    Dec 14, 2020
    Posts:
    5
    Am I supposed to see curriculum printed out? Even when I run the wall jump example environment I don't see curriculum being printed out even though it curriculum learning works for that env. I'll try to see if there's something wrong with the format of my config.

    Do you think the person on the research team would mind if he/she shared the config for running curriculum learning in imitation setup?



    2021-03-30 13:13:18 INFO [learn.py:275] run_seed set to 1263
    2021-03-30 13:13:18 INFO [environment.py:205] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.
    2021-03-30 13:13:56 INFO [environment.py:112] Connected to Unity environment with package version 1.6.0-preview and communication version 1.2.0
    2021-03-30 13:13:56 INFO [environment.py:271] Connected new brain:
    SmallWallJump?team=0
    2021-03-30 13:13:56 WARNING [stats.py:190] events.out.tfevents.1617076665.youngwook-pc.2198895.0 was left over from a previous run. Deleting.
    2021-03-30 13:13:56 INFO [stats.py:147] Hyperparameters for behavior name SmallWallJump:
    trainer_type: ppo
    hyperparameters:
    batch_size: 128
    buffer_size: 2048
    learning_rate: 0.0003
    beta: 0.005
    epsilon: 0.2
    lambd: 0.95
    num_epoch: 3
    learning_rate_schedule: linear
    network_settings:
    normalize: False
    hidden_units: 256
    num_layers: 2
    vis_encode_type: simple
    memory: None
    reward_signals:
    extrinsic:
    gamma: 0.99
    strength: 1.0
    init_path: None
    keep_checkpoints: 5
    checkpoint_interval: 500000
    max_steps: 5000000
    time_horizon: 128
    summary_freq: 20000
    threaded: True
    self_play: None
    behavioral_cloning: None
    framework: pytorch
    2021-03-30 13:13:58 INFO [environment.py:271] Connected new brain:
    BigWallJump?team=0
    2021-03-30 13:13:58 WARNING [stats.py:190] events.out.tfevents.1617076667.youngwook-pc.2198895.1 was left over from a previous run. Deleting.
    2021-03-30 13:13:58 INFO [stats.py:147] Hyperparameters for behavior name BigWallJump:
    trainer_type: ppo
    hyperparameters:
    batch_size: 128
    buffer_size: 2048
    learning_rate: 0.0003
    beta: 0.005
    epsilon: 0.2
    lambd: 0.95
    num_epoch: 3
    learning_rate_schedule: linear
    network_settings:
    normalize: False
    hidden_units: 256
    num_layers: 2
    vis_encode_type: simple
    memory: None
    reward_signals:
    extrinsic:
    gamma: 0.99
    strength: 1.0
    init_path: None
    keep_checkpoints: 5
    checkpoint_interval: 500000
    max_steps: 20000000
    time_horizon: 128
    summary_freq: 20000
    threaded: True
    self_play: None
    behavioral_cloning: None
    framework: pytorch
    2021-03-30 13:15:05 INFO [stats.py:139] BigWallJump. Step: 20000. Time Elapsed: 107.393 s. Mean Reward: -0.881. Std of Reward: 0.630. Training.
    2021-03-30 13:15:43 INFO [stats.py:139] SmallWallJump. Step: 20000. Time Elapsed: 145.265 s. Mean Reward: -0.824. Std of Reward: 0.625. Training.
    2021-03-30 13:16:23 INFO [stats.py:139] BigWallJump. Step: 40000. Time Elapsed: 184.477 s. Mean Reward: 0.009. Std of Reward: 0.934. Training.
    2021-03-30 13:17:27 INFO [stats.py:139] SmallWallJump. Step: 40000. Time Elapsed: 249.388 s. Mean Reward: -0.000. Std of Reward: 0.943. Training.
    2021-03-30 13:17:32 INFO [stats.py:139] BigWallJump. Step: 60000. Time Elapsed: 254.267 s. Mean Reward: 0.214. Std of Reward: 0.899. Training.
    2021-03-30 13:18:18 INFO [environment_parameter_manager.py:155] Parameter 'small_wall_height' has been updated to Float: value=2.0. Now in lesson 'Lesson1'
    2021-03-30 13:18:46 INFO [stats.py:139] BigWallJump. Step: 80000. Time Elapsed: 327.867 s. Mean Reward: 0.407. Std of Reward: 0.816. Training.
    2021-03-30 13:19:12 INFO [stats.py:139] SmallWallJump. Step: 60000. Time Elapsed: 353.467 s. Mean Reward: 0.128. Std of Reward: 0.927. Training.
    2021-03-30 13:20:07 INFO [stats.py:139] BigWallJump. Step: 100000. Time Elapsed: 408.920 s. Mean Reward: 0.528. Std of Reward: 0.743. Training.
    2021-03-30 13:20:47 INFO [stats.py:139] SmallWallJump. Step: 80000. Time Elapsed: 448.724 s. Mean Reward: 0.430. Std of Reward: 0.792. Training.
    2021-03-30 13:21:23 INFO [stats.py:139] BigWallJump. Step: 120000. Time Elapsed: 484.553 s. Mean Reward: 0.644. Std of Reward: 0.648. Training.
    2021-03-30 13:22:23 INFO [stats.py:139] SmallWallJump. Step: 100000. Time Elapsed: 544.860 s. Mean Reward: 0.439. Std of Reward: 0.793. Training.
    2021-03-30 13:22:45 INFO [stats.py:139] BigWallJump. Step: 140000. Time Elapsed: 567.327 s. Mean Reward: 0.641. Std of Reward: 0.660. Training.
    2021-03-30 13:24:02 INFO [stats.py:139] BigWallJump. Step: 160000. Time Elapsed: 644.375 s. Mean Reward: 0.732. Std of Reward: 0.537. Training.
    2021-03-30 13:24:07 INFO [stats.py:139] SmallWallJump. Step: 120000. Time Elapsed: 648.766 s. Mean Reward: 0.557. Std of Reward: 0.710. Training.
    2021-03-30 13:25:13 INFO [stats.py:139] BigWallJump. Step: 180000. Time Elapsed: 715.065 s. Mean Reward: 0.740. Std of Reward: 0.529. Training.
    2021-03-30 13:26:10 INFO [stats.py:139] SmallWallJump. Step: 140000. Time Elapsed: 772.192 s. Mean Reward: 0.596. Std of Reward: 0.687. Training.
    2021-03-30 13:26:25 INFO [stats.py:139] BigWallJump. Step: 200000. Time Elapsed: 786.615 s. Mean Reward: 0.716. Std of Reward: 0.551. Training.
    2021-03-30 13:26:25 INFO [environment_parameter_manager.py:155] Parameter 'big_wall_height' has been updated to Uniform sampler: min=4.0, max=7.0. Now in lesson 'Lesson1'

     
  6. kt66nfkim

    kt66nfkim

    Joined:
    Dec 14, 2020
    Posts:
    5
    @christophergoy I'm actually able to get curriculum to run now. I guess it was a formatting issue. Thanks for all the help!
     
    christophergoy likes this.