Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. We have updated the language to the Editor Terms based on feedback from our employees and community. Learn more.
    Dismiss Notice

basic tips to increase training stability

Discussion in 'ML-Agents' started by mmmbop, Jun 23, 2022.

  1. mmmbop


    Jan 22, 2022
    What parameters of the neural network are responsible for the stability of behavior?

    My training (not stable):
    Code (CSharp):
    1. [INFO] Agent. Step: 3000000. Time Elapsed: 6701.190 s. Mean Reward: 449.745. Std of Reward: 464.177. Training.
    2. [INFO] Agent. Step: 3030000. Time Elapsed: 6767.475 s. Mean Reward: 281.317. Std of Reward: 303.813. Training.
    3. [INFO] Agent. Step: 3060000. Time Elapsed: 6825.893 s. Mean Reward: 1024.422. Std of Reward: 1616.215. Training.
    4. [INFO] Agent. Step: 3090000. Time Elapsed: 6891.545 s. Mean Reward: 333.737. Std of Reward: 343.476. Training.
    5. [INFO] Agent. Step: 3120000. Time Elapsed: 6961.993 s. Mean Reward: 529.770. Std of Reward: 438.336. Training.
    6. [INFO] Agent. Step: 3150000. Time Elapsed: 7028.978 s. Mean Reward: 386.342. Std of Reward: 240.528. Training.
    7. [INFO] Agent. Step: 3180000. Time Elapsed: 7089.501 s. Mean Reward: 1242.240. Std of Reward: 1191.351. Training.
    8. [INFO] Agent. Step: 3210000. Time Elapsed: 7162.898 s. Mean Reward: 471.763. Std of Reward: 76.120. Training.
    9. [INFO] Agent. Step: 3240000. Time Elapsed: 7225.747 s. Mean Reward: 392.818. Std of Reward: 510.116. Training.

    and my networks params:

    trainer_type: ppo
    batch_size: 2048
    buffer_size: 20480
    learning_rate: 0.0003
    beta: 0.005
    epsilon: 0.2
    lambd: 0.95
    num_epoch: 3
    learning_rate_schedule: linear
    normalize: true
    hidden_units: 512
    num_layers: 3
    vis_encode_type: simple
    gamma: 0.995
    strength: 1.0
    keep_checkpoints: 30
    checkpoint_interval: 1000000
    max_steps: 50000000
    time_horizon: 1000
    summary_freq: 30000

    Last edited: Jun 24, 2022
  2. kokimitsunami


    Sep 2, 2021
    It totally depends on what task you want to perform, in my case I reduced the size of the neural network and the training stabilized. My neural network has hidden_units=128, num_layers=2, but num_layers=3 sometimes made the training extremely slow and unstabilized.
    Also, how you give a reward is also very important. You may need to adjust the way you give rewards to facilitate learning.
  3. Zibelas


    Aug 1, 2017
    Your mean reward is as well really high. Are you rewarding the agent a lot? Are the rewards high in nature? (reward > 1).
    Are you having a small decay on each step? Are you punishing wrong actions? Is your enviroment fixed or random? Why is it called always Agent and once SimpleWalker?
  4. mmmbop


    Jan 22, 2022
    Im doing some areas like Walker/Crawler (Locomotion ) - so my reward the more it passes the better.
    My character now have not bad walking , but sometimes it failed.
    I will try to play to size of hidden_layers and other suggested.

    No my reward designeed like it can be max 1 at each time stamp, as i say - with my task i have no limit - just infinite walking. Thats just copy/paste error .