Dear all, I have updated my version of ml-agents to 0.26.0 and noticed some changes to the configuration file. I have a couple of questions regarding the differences between the first network (see image) and the extra network that appears under "reward signals". For the case of PPO, does the first network represents the "actor" or policy network? Does the network for reward signals represents the "critic network"? If so, is the critic network parametrizing the value function V(s) or the state-action value function Q(a,s)? Are there any links available for the pytorch implementation? Should both these networks have similar design? i.e. number of layers, hidden units. From my understanding, "goal_conditioning_type = hyper" generates yet another neural network. In the image below, I understand that 2 hyper-networks are being generated (one for each network). Is that correct? If so, what is the configuration (num of layers/hidden units) of these hyper networks? Sorry for the long post and many thanks!
Hi, The #1 network settings is used for both the actor and the critic. #2 is unused in the case of extrinsic reward because the extrinsic reward is given by the environment. Other reward signals such as GAIL or RND use a neural network and the settings #2 are used for these networks. You can (and should) remove the whole #2 network settings as it is not used at all in your case. I hope this clarifies it and sorry for the confusion.
Thank you for the reply! I have tried removing the second network from trainer_config.yaml file, however, when I start running Unity ML-Agents, the second network still appears... trainer_config.yaml: Code (CSharp): behaviors: BehaviorPPO: trainer_type: ppo hyperparameters: # Hyperparameters common to PPO and SAC batch_size: 256 buffer_size: 4056 learning_rate: 1.5e-4 learning_rate_schedule: linear # PPO-specific hyperparameters # Replaces the "PPO-specific hyperparameters" section above beta: 4.6e-2 epsilon: 0.2 lambd: 0.98 num_epoch: 6 # Configuration of the neural network (common to PPO/SAC) network_settings: # vis_encoder_type: simple normalize: true hidden_units: 48 num_layers: 4 goal_conditioning_type: hyper # memory #memory: # sequence_length: 16 # memory_size: 48 # Trainer configurations common to all trainers max_steps: 4.0e7 time_horizon: 54 summary_freq: 40000 # keep_checkpoints: 5 # checkpoint_interval: 50000 # threaded: true # init_path: null reward_signals: # environment reward (default) extrinsic: strength: 1.0 gamma: 0.97 #network_settings: #normalize: false #hidden_units: 128 #num_layers: 2 # vis_encode_type: simple # memory: None #goal_conditioning_type: hyper Unity ML-Agents output:
I reinstalled everything (now with unity environment package version 2.0), but the issue persists... Not only that, but changing the number of hidden_units/num_layers for the #2 network is changing my training results, which doesn't make sense since I am only using extrinsic rewards...
The network settings #2 will appear on the terminal at the start of training but will not be used. Changes to these settings should not impact the training performance. Are you sure that what you are seeing is not just stochasticity in the training?
My mistake! Just ran the model again and indeed changing settings for the #2 network is not affecting training performance.