Search Unity

  1. Unity 6 Preview is now available. To find out what's new, have a look at our Unity 6 Preview blog post.
    Dismiss Notice
  2. Unity is excited to announce that we will be collaborating with TheXPlace for a summer game jam from June 13 - June 19. Learn more.
    Dismiss Notice

Question the training process get stuck

Discussion in 'ML-Agents' started by autoli, Jul 2, 2020.

  1. autoli

    autoli

    Joined:
    May 25, 2020
    Posts:
    9
    The training progress is normal and correct at the begining. The training is fixed and stuck without errors when the step arrives at a number that is not the max step. Most of the time, the other trainings with different hyperparameters and models don't have the problem. I try to use "--resume" but it seems the training doesn't progress, the exe env appears and works but stop moving and fail to respond after a moment. This may have the relation with hyperparameter setting. This time I use PPO and SAC at the same time. What's more, I change the sac buffer size and get the same result. the following picture appears stuck when I use "ctrl c" to stop the training and it have to wait a long time to end the training process.
    upload_2020-7-2_16-43-9.png
     

    Attached Files:

  2. autoli

    autoli

    Joined:
    May 25, 2020
    Posts:
    9
    The information about ML
    ml-agents: 0.16.1,
    ml-agents-envs: 0.16.1,
    Communicator API: 1.0.0,
    TensorFlow: 1.13.1
     
  3. autoli

    autoli

    Joined:
    May 25, 2020
    Posts:
    9
    I have tried to use --initialize-from the stuck model.it trains correctly and properly! I don't know what's the problem.I can't use --resume to retrain it.But i't important for me to keep the curve graph with "--resume"
     
  4. autoli

    autoli

    Joined:
    May 25, 2020
    Posts:
    9
    I use debug to see the result process. it turns out that the sac is updating all the time
    upload_2020-7-2_21-39-16.png
     
  5. TreyK-47

    TreyK-47

    Unity Technologies

    Joined:
    Oct 22, 2019
    Posts:
    1,835
    I'll flag this for the team to take a look.
     
  6. autoli

    autoli

    Joined:
    May 25, 2020
    Posts:
    9
    Thank you for your attention! It may end up updating the sac after many hours. But this may waste too much time. The time of the "--resume" can be as much as a new sac training!
     
  7. andrewcoh_unity

    andrewcoh_unity

    Unity Technologies

    Joined:
    Sep 5, 2019
    Posts:
    162
    Hi @autoli

    Just to clarify, is the problem that you are using --resume with the same run-id and it is not working properly? Can you copy your trainer configuration yaml here?
     
  8. autoli

    autoli

    Joined:
    May 25, 2020
    Posts:
    9
    Sorry, I'm late! Yes, I use --resume with the same run-id. the configuration yaml can be the follwing!
    default:
    trainer: sac
    batch_size: 128
    buffer_size: 50000
    buffer_init_steps: 0
    hidden_units: 128
    init_entcoef: 1.0
    learning_rate: 3.0e-4
    learning_rate_schedule: constant
    max_steps: 5.0e5
    memory_size: 128
    normalize: false
    steps_per_update: 10
    num_layers: 2
    time_horizon: 64
    sequence_length: 64
    summary_freq: 10000
    tau: 0.005
    use_recurrent: false
    vis_encode_type: simple
    reward_signals:
    extrinsic:
    strength: 1.0
    gamma: 0.99

    Police_cach:
    summary_freq: 30000
    time_horizon: 256
    batch_size: 256
    buffer_init_steps: 10000
    buffer_size: 500000
    hidden_units: 256
    num_layers: 2
    init_entcoef: 0.01
    max_steps: 1.0e8
    sequence_length: 16
    tau: 0.01
    use_recurrent: false
    reward_signals:
    extrinsic:
    strength: 2.0
    gamma: 0.99
     
  9. autoli

    autoli

    Joined:
    May 25, 2020
    Posts:
    9
    What's more, the SAC is indeed updating after --resume but it costs too much time! it seems the bug has been fixed in release 3.0 ! But I try to use the ML-Agents Release 3 but it may also need waiting long time! I don't know whether the bug is fixed completely or I use it wrongly.
    upload_2020-7-14_11-20-42.png