Search Unity

  1. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

Question PPO Training Issue: Agent Receiving Same Actions Near Optimal Policy

Discussion in 'ML-Agents' started by tamastardi7, Apr 9, 2023.

  1. tamastardi7


    Oct 2, 2021
    Hi everyone!

    I am currently using PPO in my project, but I have noticed an issue during training. As my agent gets closer to the optimal policy, it begins receiving the same actions, which results in significantly worse rewards. My project includes 80 observations, which contain booleans, floats ranging from 0 to 1, integers, and Vector3s. The agent receives 2 discrete and 2 continuous actions.

    I have included pictures of runs attempting to solve the same problem, with only the hyperparameters changed (such as learning rate, beta, num_layers, buffer_size, batch_size, and hidden_units).

    Here is an example of the hyperparameters:

    trainer_type:   ppo
    batch_size: 1024
    buffer_size: 10240
    learning_rate: 0.001
    beta: 0.001
    epsilon: 0.2
    lambd: 0.95
    num_epoch: 3
    learning_rate_schedule: constant
    beta_schedule: constant
    epsilon_schedule: constant
    normalize: True
    hidden_units: 248
    num_layers: 2
    vis_encode_type: simple
    memory: None
    goal_conditioning_type: hyper
    deterministic: False
    gamma: 0.99
    strength: 1.0
    normalize: False
    hidden_units: 128
    num_layers: 2
    vis_encode_type: simple
    memory: None
    goal_conditioning_type: hyper
    deterministic: False
    init_path: None
    keep_checkpoints: 5
    checkpoint_interval: 500000
    max_steps: 20000000
    time_horizon: 128
    summary_freq: 1000
    threaded: True
    self_play: None
    behavioral_cloning: None

    How the rewards look:

    How the entropy changes over time:

    I have also added all of the important (mapped) actions to the tensorboard, so I can see how those change. One of those:

    Currently, I am training my agent using an executable with 15 parallel environments, and I have not found any exceptions in the logs. However, I am unsure whether this is expected behavior, or if it indicates that the agent has found the optimal policy and cannot improve any further.

    If the agent has indeed reached the optimal policy, I am wondering if there is a way to automatically stop the training process when this occurs.

    Thank you in advance for your help!

    Attached Files: