Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. We’re making changes to the Unity Runtime Fee pricing policy that we announced on September 12th. Access our latest thread for more information!
    Dismiss Notice
  3. Dismiss Notice

Question the training process get stuck

Discussion in 'ML-Agents' started by autoli, Jul 2, 2020.

  1. autoli

    autoli

    Joined:
    May 25, 2020
    Posts:
    9
    The training progress is normal and correct at the begining. The training is fixed and stuck without errors when the step arrives at a number that is not the max step. Most of the time, the other trainings with different hyperparameters and models don't have the problem. I try to use "--resume" but it seems the training doesn't progress, the exe env appears and works but stop moving and fail to respond after a moment. This may have the relation with hyperparameter setting. This time I use PPO and SAC at the same time. What's more, I change the sac buffer size and get the same result. the following picture appears stuck when I use "ctrl c" to stop the training and it have to wait a long time to end the training process.
    upload_2020-7-2_16-43-9.png
     

    Attached Files:

  2. autoli

    autoli

    Joined:
    May 25, 2020
    Posts:
    9
    The information about ML
    ml-agents: 0.16.1,
    ml-agents-envs: 0.16.1,
    Communicator API: 1.0.0,
    TensorFlow: 1.13.1
     
  3. autoli

    autoli

    Joined:
    May 25, 2020
    Posts:
    9
    I have tried to use --initialize-from the stuck model.it trains correctly and properly! I don't know what's the problem.I can't use --resume to retrain it.But i't important for me to keep the curve graph with "--resume"
     
  4. autoli

    autoli

    Joined:
    May 25, 2020
    Posts:
    9
    I use debug to see the result process. it turns out that the sac is updating all the time
    upload_2020-7-2_21-39-16.png
     
  5. TreyK-47

    TreyK-47

    Unity Technologies

    Joined:
    Oct 22, 2019
    Posts:
    1,787
    I'll flag this for the team to take a look.
     
  6. autoli

    autoli

    Joined:
    May 25, 2020
    Posts:
    9
    Thank you for your attention! It may end up updating the sac after many hours. But this may waste too much time. The time of the "--resume" can be as much as a new sac training!
     
  7. andrewcoh_unity

    andrewcoh_unity

    Unity Technologies

    Joined:
    Sep 5, 2019
    Posts:
    162
    Hi @autoli

    Just to clarify, is the problem that you are using --resume with the same run-id and it is not working properly? Can you copy your trainer configuration yaml here?
     
  8. autoli

    autoli

    Joined:
    May 25, 2020
    Posts:
    9
    Sorry, I'm late! Yes, I use --resume with the same run-id. the configuration yaml can be the follwing!
    default:
    trainer: sac
    batch_size: 128
    buffer_size: 50000
    buffer_init_steps: 0
    hidden_units: 128
    init_entcoef: 1.0
    learning_rate: 3.0e-4
    learning_rate_schedule: constant
    max_steps: 5.0e5
    memory_size: 128
    normalize: false
    steps_per_update: 10
    num_layers: 2
    time_horizon: 64
    sequence_length: 64
    summary_freq: 10000
    tau: 0.005
    use_recurrent: false
    vis_encode_type: simple
    reward_signals:
    extrinsic:
    strength: 1.0
    gamma: 0.99

    Police_cach:
    summary_freq: 30000
    time_horizon: 256
    batch_size: 256
    buffer_init_steps: 10000
    buffer_size: 500000
    hidden_units: 256
    num_layers: 2
    init_entcoef: 0.01
    max_steps: 1.0e8
    sequence_length: 16
    tau: 0.01
    use_recurrent: false
    reward_signals:
    extrinsic:
    strength: 2.0
    gamma: 0.99
     
  9. autoli

    autoli

    Joined:
    May 25, 2020
    Posts:
    9
    What's more, the SAC is indeed updating after --resume but it costs too much time! it seems the bug has been fixed in release 3.0 ! But I try to use the ML-Agents Release 3 but it may also need waiting long time! I don't know whether the bug is fixed completely or I use it wrongly.
    upload_2020-7-14_11-20-42.png