Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. We have updated the language to the Editor Terms based on feedback from our employees and community. Learn more.
    Dismiss Notice

basic tips to increase training stability

Discussion in 'ML-Agents' started by mmmbop, Jun 23, 2022.

  1. mmmbop

    mmmbop

    Joined:
    Jan 22, 2022
    Posts:
    20
    What parameters of the neural network are responsible for the stability of behavior?

    My training (not stable):
    Code (CSharp):
    1. [INFO] Agent. Step: 3000000. Time Elapsed: 6701.190 s. Mean Reward: 449.745. Std of Reward: 464.177. Training.
    2. [INFO] Agent. Step: 3030000. Time Elapsed: 6767.475 s. Mean Reward: 281.317. Std of Reward: 303.813. Training.
    3. [INFO] Agent. Step: 3060000. Time Elapsed: 6825.893 s. Mean Reward: 1024.422. Std of Reward: 1616.215. Training.
    4. [INFO] Agent. Step: 3090000. Time Elapsed: 6891.545 s. Mean Reward: 333.737. Std of Reward: 343.476. Training.
    5. [INFO] Agent. Step: 3120000. Time Elapsed: 6961.993 s. Mean Reward: 529.770. Std of Reward: 438.336. Training.
    6. [INFO] Agent. Step: 3150000. Time Elapsed: 7028.978 s. Mean Reward: 386.342. Std of Reward: 240.528. Training.
    7. [INFO] Agent. Step: 3180000. Time Elapsed: 7089.501 s. Mean Reward: 1242.240. Std of Reward: 1191.351. Training.
    8. [INFO] Agent. Step: 3210000. Time Elapsed: 7162.898 s. Mean Reward: 471.763. Std of Reward: 76.120. Training.
    9. [INFO] Agent. Step: 3240000. Time Elapsed: 7225.747 s. Mean Reward: 392.818. Std of Reward: 510.116. Training.

    and my networks params:


    behaviors:
    Agent:
    trainer_type: ppo
    hyperparameters:
    batch_size: 2048
    buffer_size: 20480
    learning_rate: 0.0003
    beta: 0.005
    epsilon: 0.2
    lambd: 0.95
    num_epoch: 3
    learning_rate_schedule: linear
    network_settings:
    normalize: true
    hidden_units: 512
    num_layers: 3
    vis_encode_type: simple
    reward_signals:
    extrinsic:
    gamma: 0.995
    strength: 1.0
    keep_checkpoints: 30
    checkpoint_interval: 1000000
    max_steps: 50000000
    time_horizon: 1000
    summary_freq: 30000



    Thanks!
     
    Last edited: Jun 24, 2022
  2. kokimitsunami

    kokimitsunami

    Joined:
    Sep 2, 2021
    Posts:
    25
    It totally depends on what task you want to perform, in my case I reduced the size of the neural network and the training stabilized. My neural network has hidden_units=128, num_layers=2, but num_layers=3 sometimes made the training extremely slow and unstabilized.
    Also, how you give a reward is also very important. You may need to adjust the way you give rewards to facilitate learning.
     
  3. Zibelas

    Zibelas

    Joined:
    Aug 1, 2017
    Posts:
    6
    Your mean reward is as well really high. Are you rewarding the agent a lot? Are the rewards high in nature? (reward > 1).
    Are you having a small decay on each step? Are you punishing wrong actions? Is your enviroment fixed or random? Why is it called always Agent and once SimpleWalker?
     
  4. mmmbop

    mmmbop

    Joined:
    Jan 22, 2022
    Posts:
    20
    Thanks!
    Im doing some areas like Walker/Crawler (Locomotion ) - so my reward the more it passes the better.
    My character now have not bad walking , but sometimes it failed.
    I will try to play to size of hidden_layers and other suggested.

    No my reward designeed like it can be max 1 at each time stamp, as i say - with my task i have no limit - just infinite walking. Thats just copy/paste error .