Question the training process get stuck

autoli · Jul 2, 2020

The training progress is normal and correct at the begining. The training is fixed and stuck without errors when the step arrives at a number that is not the max step. Most of the time, the other trainings with different hyperparameters and models don't have the problem. I try to use "--resume" but it seems the training doesn't progress, the exe env appears and works but stop moving and fail to respond after a moment. This may have the relation with hyperparameter setting. This time I use PPO and SAC at the same time. What's more, I change the sac buffer size and get the same result. the following picture appears stuck when I use "ctrl c" to stop the training and it have to wait a long time to end the training process.

autoli · Jul 2, 2020

The information about ML
ml-agents: 0.16.1,
ml-agents-envs: 0.16.1,
Communicator API: 1.0.0,
TensorFlow: 1.13.1

autoli · Jul 2, 2020

I have tried to use --initialize-from the stuck model.it trains correctly and properly! I don't know what's the problem.I can't use --resume to retrain it.But i't important for me to keep the curve graph with "--resume"

autoli · Jul 2, 2020

I use debug to see the result process. it turns out that the sac is updating all the time

TreyK-47 · Jul 9, 2020

I'll flag this for the team to take a look.

autoli · Jul 10, 2020

Thank you for your attention! It may end up updating the sac after many hours. But this may waste too much time. The time of the "--resume" can be as much as a new sac training!

andrewcoh_unity · Jul 10, 2020

Hi @autoli

Just to clarify, is the problem that you are using --resume with the same run-id and it is not working properly? Can you copy your trainer configuration yaml here?

autoli · Jul 14, 2020

Sorry, I'm late! Yes, I use --resume with the same run-id. the configuration yaml can be the follwing!
default:
trainer: sac
batch_size: 128
buffer_size: 50000
buffer_init_steps: 0
hidden_units: 128
init_entcoef: 1.0
learning_rate: 3.0e-4
learning_rate_schedule: constant
max_steps: 5.0e5
memory_size: 128
normalize: false
steps_per_update: 10
num_layers: 2
time_horizon: 64
sequence_length: 64
summary_freq: 10000
tau: 0.005
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99

Police_cach:
summary_freq: 30000
time_horizon: 256
batch_size: 256
buffer_init_steps: 10000
buffer_size: 500000
hidden_units: 256
num_layers: 2
init_entcoef: 0.01
max_steps: 1.0e8
sequence_length: 16
tau: 0.01
use_recurrent: false
reward_signals:
extrinsic:
strength: 2.0
gamma: 0.99

autoli · Jul 14, 2020

What's more, the SAC is indeed updating after --resume but it costs too much time! it seems the bug has been fixed in release 3.0 ! But I try to use the ML-Agents Release 3 but it may also need waiting long time! I don't know whether the bug is fixed completely or I use it wrongly.

Search Unity

Question the training process get stuck

autoli

Attached Files:

upload_2020-7-2_16-30-9.png

upload_2020-7-2_16-42-17.png

autoli

autoli

autoli

TreyK-47

Unity Technologies

autoli

andrewcoh_unity

Unity Technologies

autoli

autoli

Search Unity

Unity ID

Useful Searches

Question the training process get stuck

Attached Files:

Unity Technologies

Unity Technologies