Resolved Curriculum learning not being triggered?

kt66nfkim · Mar 30, 2021

Hi

Is it possible to mix curriculum learning with imitation learning? Maybe I'm configuring things wrong, but when I append my curriculum learning configs to imitation learning configs, it appears the curriculum part never gets triggered.

this is the config I'm using. Any hints or help would be appreciated.



behaviors:

  PongBehavior:

    trainer_type: ppo

    hyperparameters:

      batch_size: 256

      buffer_size: 10240

      learning_rate: 0.0003

      beta: 0.005

      epsilon: 0.2

      lambd: 0.95

      num_epoch: 3

      learning_rate_schedule: linear

    network_settings:

      normalize: true

      hidden_units: 128

      num_layers: 3

    reward_signals:

      extrinsic:

        gamma: 0.99

        strength: 1.0

      gail:

        gamma: 0.99

        strength: 0.5

        encoding_size: 128

        learning_rate: 0.0003

        demo_path: /home/Documents/ML-agents/ml-agents/CarPong/Assets/Demonstrations/PongBehavior.demo

    behavioral_cloning:

      strength: 0.9

      demo_path: /home/Documents/ML-agents/ml-agents/CarPong/Assets/Demonstrations/PongBehavior.demo

    keep_checkpoints: 5

    max_steps: 1000000

    time_horizon: 64

    summary_freq: 10000

    threaded: true

environment_parameters:

  drone_targets_widths:

    curriculum:

      - name: Lesson0

        completion_criteria:

          measure: progress

          behavior: PongBehavior

          signal_smoothing: true

          min_lesson_length: 10000

          threshold: 0.1

        value:

          sampler_type: uniform

          sampler_parameters:

            min_value: 0.0

            max_value: 0.0

      - name: Lesson1

        completion_criteria:

          measure: progress

          behavior: PongBehavior

          signal_smoothing: true

          min_lesson_length: 10000

          threshold: 0.3

        value:

          sampler_type: uniform

          sampler_parameters:

            min_value: 0.0

            max_value: 5.0

      - name: Lesson2

        completion_criteria:

          measure: progress

          behavior: PongBehavior

          signal_smoothing: true

          min_lesson_length: 10000

          threshold: 0.65

        value:

          sampler_type: uniform

          sampler_parameters:

            min_value: 0.0

            max_value: 15.0

      - name: Lesson3

        completion_criteria:

          measure: progress

          behavior: PongBehavior

          signal_smoothing: true

          min_lesson_length: 10000

          threshold: 0.8

        value:

          sampler_type: uniform

          sampler_parameters:

            min_value: 0.0

            max_value: 25.0

      - name: Lesson4

        completion_criteria:

          measure: reward

          behavior: PongBehavior

          signal_smoothing: true

          min_lesson_length: 10000

          threshold: 200

        value:

          sampler_type: uniform

          sampler_parameters:

            min_value: 0.0

            max_value: 50.0

Thanks!

christophergoy · Mar 30, 2021

Hi @kt66nfkim,
This sounds reasonable to me. I'm reaching out some folks on the research team to find out.

When you say the curriculum part never gets triggered, do you mean that you never leave lesson one even if your steps go over 100k? I see that your max_steps are set to 1,000,000 and your threshold for lesson 0 is .1. So you'd need to wait for 100k steps to move to the next lesson. Is that correct?

kt66nfkim · Mar 30, 2021

Hi @christophergoy

Yes, that's exactly what I'm seeing. Even after 100k steps the environment never moves onto the next lesson. Here's a small sample output from one training run I tried using the config file I shared earlier. You can see that even after 200k steps the environment stays in the same lesson.

I also double checked with and used similar config on PPO and it seems to be working fine so I'm little lost on why it doesn't seem to work with imitation learning.



2021-03-30 12:27:17 INFO [stats.py:139] PongBehavior. Step: 10000. Time Elapsed: 72.417 s. Mean Reward: 137.500. Std of Reward: 131.696. Training.

2021-03-30 12:28:07 INFO [stats.py:139] PongBehavior. Step: 20000. Time Elapsed: 122.900 s. Mean Reward: 12.500. Std of Reward: 92.702. Training.

2021-03-30 12:28:59 INFO [stats.py:139] PongBehavior. Step: 30000. Time Elapsed: 174.495 s. Mean Reward: 227.080. Std of Reward: 195.504. Training.

2021-03-30 12:29:50 INFO [stats.py:139] PongBehavior. Step: 40000. Time Elapsed: 226.001 s. Mean Reward: 455.867. Std of Reward: 320.745. Training.

2021-03-30 12:30:42 INFO [stats.py:139] PongBehavior. Step: 50000. Time Elapsed: 278.101 s. Mean Reward: 779.522. Std of Reward: 340.896. Training.

2021-03-30 12:31:34 INFO [stats.py:139] PongBehavior. Step: 60000. Time Elapsed: 330.013 s. Mean Reward: 762.778. Std of Reward: 388.825. Training.

2021-03-30 12:32:26 INFO [stats.py:139] PongBehavior. Step: 70000. Time Elapsed: 381.935 s. Mean Reward: 945.360. Std of Reward: 452.997. Training.

2021-03-30 12:33:18 INFO [stats.py:139] PongBehavior. Step: 80000. Time Elapsed: 433.979 s. Mean Reward: 693.067. Std of Reward: 440.927. Training.

2021-03-30 12:34:10 INFO [stats.py:139] PongBehavior. Step: 90000. Time Elapsed: 485.811 s. Mean Reward: 1177.640. Std of Reward: 428.542. Training.

2021-03-30 12:35:02 INFO [stats.py:139] PongBehavior. Step: 100000. Time Elapsed: 537.423 s. Mean Reward: 1004.973. Std of Reward: 457.758. Training.

2021-03-30 12:35:54 INFO [stats.py:139] PongBehavior. Step: 110000. Time Elapsed: 589.437 s. Mean Reward: 635.436. Std of Reward: 326.039. Training.

2021-03-30 12:36:46 INFO [stats.py:139] PongBehavior. Step: 120000. Time Elapsed: 641.146 s. Mean Reward: 903.590. Std of Reward: 509.251. Training.

2021-03-30 12:37:38 INFO [stats.py:139] PongBehavior. Step: 130000. Time Elapsed: 693.732 s. Mean Reward: 590.364. Std of Reward: 313.053. Training.

2021-03-30 12:38:30 INFO [stats.py:139] PongBehavior. Step: 140000. Time Elapsed: 746.029 s. Mean Reward: 822.222. Std of Reward: 428.895. Training.

2021-03-30 12:39:22 INFO [stats.py:139] PongBehavior. Step: 150000. Time Elapsed: 797.526 s. Mean Reward: 862.636. Std of Reward: 306.899. Training.

2021-03-30 12:40:14 INFO [stats.py:139] PongBehavior. Step: 160000. Time Elapsed: 849.556 s. Mean Reward: 714.709. Std of Reward: 436.609. Training.

2021-03-30 12:41:08 INFO [stats.py:139] PongBehavior. Step: 170000. Time Elapsed: 903.538 s. Mean Reward: 951.009. Std of Reward: 417.411. Training.

2021-03-30 12:42:02 INFO [stats.py:139] PongBehavior. Step: 180000. Time Elapsed: 957.181 s. Mean Reward: 780.618. Std of Reward: 340.392. Training.

2021-03-30 12:42:55 INFO [stats.py:139] PongBehavior. Step: 190000. Time Elapsed: 1010.496 s. Mean Reward: 711.111. Std of Reward: 414.848. Training.

2021-03-30 12:43:48 INFO [stats.py:139] PongBehavior. Step: 200000. Time Elapsed: 1063.520 s. Mean Reward: 432.400. Std of Reward: 268.492. Training.

2021-03-30 12:44:40 INFO [stats.py:139] PongBehavior. Step: 210000. Time Elapsed: 1115.521 s. Mean Reward: 535.240. Std of Reward: 322.129. Training.

2021-03-30 12:45:31 INFO [stats.py:139] PongBehavior. Step: 220000. Time Elapsed: 1167.089 s. Mean Reward: 561.925. Std of Reward: 486.984. Training.

2021-03-30 12:46:23 INFO [stats.py:139] PongBehavior. Step: 230000. Time Elapsed: 1218.883 s. Mean Reward: 441.390. Std of Reward: 400.383. Training.

2021-03-30 12:47:15 INFO [stats.py:139] PongBehavior. Step: 240000. Time Elapsed: 1270.426 s. Mean Reward: 780.022. Std of Reward: 365.279. Training.

christophergoy · Mar 30, 2021

Someone on our research team just tried it out with the Pushblock environment and it worked for them. So maybe there is a formatting issue in your yaml? I'm not quite sure.

Do you see the curriculum printed out when you start training as part of the training config?

kt66nfkim · Mar 30, 2021

Am I supposed to see curriculum printed out? Even when I run the wall jump example environment I don't see curriculum being printed out even though it curriculum learning works for that env. I'll try to see if there's something wrong with the format of my config.

Do you think the person on the research team would mind if he/she shared the config for running curriculum learning in imitation setup?



2021-03-30 13:13:18 INFO [learn.py:275] run_seed set to 1263

2021-03-30 13:13:18 INFO [environment.py:205] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.

2021-03-30 13:13:56 INFO [environment.py:112] Connected to Unity environment with package version 1.6.0-preview and communication version 1.2.0

2021-03-30 13:13:56 INFO [environment.py:271] Connected new brain:

SmallWallJump?team=0

2021-03-30 13:13:56 WARNING [stats.py:190] events.out.tfevents.1617076665.youngwook-pc.2198895.0 was left over from a previous run. Deleting.

2021-03-30 13:13:56 INFO [stats.py:147] Hyperparameters for behavior name SmallWallJump: 

    trainer_type:    ppo

    hyperparameters:   

      batch_size:    128

      buffer_size:    2048

      learning_rate:    0.0003

      beta:    0.005

      epsilon:    0.2

      lambd:    0.95

      num_epoch:    3

      learning_rate_schedule:    linear

    network_settings:   

      normalize:    False

      hidden_units:    256

      num_layers:    2

      vis_encode_type:    simple

      memory:    None

    reward_signals:   

      extrinsic:   

        gamma:    0.99

        strength:    1.0

    init_path:    None

    keep_checkpoints:    5

    checkpoint_interval:    500000

    max_steps:    5000000

    time_horizon:    128

    summary_freq:    20000

    threaded:    True

    self_play:    None

    behavioral_cloning:    None

    framework:    pytorch

2021-03-30 13:13:58 INFO [environment.py:271] Connected new brain:

BigWallJump?team=0

2021-03-30 13:13:58 WARNING [stats.py:190] events.out.tfevents.1617076667.youngwook-pc.2198895.1 was left over from a previous run. Deleting.

2021-03-30 13:13:58 INFO [stats.py:147] Hyperparameters for behavior name BigWallJump: 

    trainer_type:    ppo

    hyperparameters:   

      batch_size:    128

      buffer_size:    2048

      learning_rate:    0.0003

      beta:    0.005

      epsilon:    0.2

      lambd:    0.95

      num_epoch:    3

      learning_rate_schedule:    linear

    network_settings:   

      normalize:    False

      hidden_units:    256

      num_layers:    2

      vis_encode_type:    simple

      memory:    None

    reward_signals:   

      extrinsic:   

        gamma:    0.99

        strength:    1.0

    init_path:    None

    keep_checkpoints:    5

    checkpoint_interval:    500000

    max_steps:    20000000

    time_horizon:    128

    summary_freq:    20000

    threaded:    True

    self_play:    None

    behavioral_cloning:    None

    framework:    pytorch

2021-03-30 13:15:05 INFO [stats.py:139] BigWallJump. Step: 20000. Time Elapsed: 107.393 s. Mean Reward: -0.881. Std of Reward: 0.630. Training.

2021-03-30 13:15:43 INFO [stats.py:139] SmallWallJump. Step: 20000. Time Elapsed: 145.265 s. Mean Reward: -0.824. Std of Reward: 0.625. Training.

2021-03-30 13:16:23 INFO [stats.py:139] BigWallJump. Step: 40000. Time Elapsed: 184.477 s. Mean Reward: 0.009. Std of Reward: 0.934. Training.

2021-03-30 13:17:27 INFO [stats.py:139] SmallWallJump. Step: 40000. Time Elapsed: 249.388 s. Mean Reward: -0.000. Std of Reward: 0.943. Training.

2021-03-30 13:17:32 INFO [stats.py:139] BigWallJump. Step: 60000. Time Elapsed: 254.267 s. Mean Reward: 0.214. Std of Reward: 0.899. Training.

2021-03-30 13:18:18 INFO [environment_parameter_manager.py:155] Parameter 'small_wall_height' has been updated to Float: value=2.0. Now in lesson 'Lesson1'

2021-03-30 13:18:46 INFO [stats.py:139] BigWallJump. Step: 80000. Time Elapsed: 327.867 s. Mean Reward: 0.407. Std of Reward: 0.816. Training.

2021-03-30 13:19:12 INFO [stats.py:139] SmallWallJump. Step: 60000. Time Elapsed: 353.467 s. Mean Reward: 0.128. Std of Reward: 0.927. Training.

2021-03-30 13:20:07 INFO [stats.py:139] BigWallJump. Step: 100000. Time Elapsed: 408.920 s. Mean Reward: 0.528. Std of Reward: 0.743. Training.

2021-03-30 13:20:47 INFO [stats.py:139] SmallWallJump. Step: 80000. Time Elapsed: 448.724 s. Mean Reward: 0.430. Std of Reward: 0.792. Training.

2021-03-30 13:21:23 INFO [stats.py:139] BigWallJump. Step: 120000. Time Elapsed: 484.553 s. Mean Reward: 0.644. Std of Reward: 0.648. Training.

2021-03-30 13:22:23 INFO [stats.py:139] SmallWallJump. Step: 100000. Time Elapsed: 544.860 s. Mean Reward: 0.439. Std of Reward: 0.793. Training.

2021-03-30 13:22:45 INFO [stats.py:139] BigWallJump. Step: 140000. Time Elapsed: 567.327 s. Mean Reward: 0.641. Std of Reward: 0.660. Training.

2021-03-30 13:24:02 INFO [stats.py:139] BigWallJump. Step: 160000. Time Elapsed: 644.375 s. Mean Reward: 0.732. Std of Reward: 0.537. Training.

2021-03-30 13:24:07 INFO [stats.py:139] SmallWallJump. Step: 120000. Time Elapsed: 648.766 s. Mean Reward: 0.557. Std of Reward: 0.710. Training.

2021-03-30 13:25:13 INFO [stats.py:139] BigWallJump. Step: 180000. Time Elapsed: 715.065 s. Mean Reward: 0.740. Std of Reward: 0.529. Training.

2021-03-30 13:26:10 INFO [stats.py:139] SmallWallJump. Step: 140000. Time Elapsed: 772.192 s. Mean Reward: 0.596. Std of Reward: 0.687. Training.

2021-03-30 13:26:25 INFO [stats.py:139] BigWallJump. Step: 200000. Time Elapsed: 786.615 s. Mean Reward: 0.716. Std of Reward: 0.551. Training.

2021-03-30 13:26:25 INFO [environment_parameter_manager.py:155] Parameter 'big_wall_height' has been updated to Uniform sampler: min=4.0, max=7.0. Now in lesson 'Lesson1'

kt66nfkim · Mar 30, 2021

@christophergoy I'm actually able to get curriculum to run now. I guess it was a formatting issue. Thanks for all the help!

Search Unity

Resolved Curriculum learning not being triggered?

kt66nfkim

christophergoy

Unity Technologies

kt66nfkim

christophergoy

Unity Technologies

kt66nfkim

kt66nfkim

Search Unity

Unity ID

Useful Searches

Resolved Curriculum learning not being triggered?

kt66nfkim

christophergoy

Unity Technologies

kt66nfkim

christophergoy

Unity Technologies

kt66nfkim

kt66nfkim