Search Unity

  1. We are migrating the Unity Forums to Unity Discussions by the end of July. Read our announcement for more information and let us know if you have any questions.
    Dismiss Notice
  2. Dismiss Notice

Resolved ML Agents - Brain favors specific solution despite equal rewards

Discussion in 'ML-Agents' started by rahulchawla2801, Aug 25, 2023.

  1. rahulchawla2801


    Oct 5, 2021
    I'm facing a problem with Unity ML Agents. I've got 10 different positive solutions out of 200, all giving the same +1 reward, and other 190, -1 reward. But my trained brain is oddly leaning towards one solution more than the other 9.

    • I'm working on a relatively straightforward practice problem where I apply an impulse to a rigidbody based on input from the OnAction function. The reward is determined by where the rigidbody lands after the impulse.
    • To avoid conflicts, I've incorporated a boolean lock in the OnActionReceived function, ensuring that no new impulse is applied while the rigidbody is waiting for a reward.
    • One variable that distinguishes the positive solutions is the time interval between the impulse and the reward. However, I've attempted to mitigate this discrepancy by uniformly delaying the reward function for all positive solutions.
    • I've exhaustively experimented with a wide range of parameters, including:
      • beta: [0.001 to 0.5]
      • epsilon: [0 to 0.95]
      • lambda: [0 to 0.95]
      • strength: [0.1 to 1]
      • gamma: [0, 0.9]
    • For each parameter combination, I've conducted rigorous training with 1 million steps. However, either the cumulative reward doesn't converge or, when convergence occurs, the agent favors only one specific positive solution during validation.
    • I've implemented the action input using 3 discrete values that fall within the range of [0, 6]. These values are obtained using the actions.DiscreteActions function. These three discrete values are then used collectively to generate a Vector3 impulse that needs to be applied to the rigidbody.

      [*]Since I am running around 1000 agents in parallel, I have created a simple pool of 1000 rigidbodies which are initialized and SetActive() at the onset of the action cycle for each agent.

    That outcome that I want

    I want the training process to ensure that all 10 solutions I've identified have nearly equal importance. This way, when I use these solutions in my game, the AI brain should randomly select any one of these 10 solutions, rather than favoring just one of them.

    One of the Yaml files
    Code (JavaScript):
    1. behaviors:
    2.   AO:
    3.     trainer_type: ppo
    4.     hyperparameters:
    5.       batch_size: 128
    6.       buffer_size: 20480
    7.       learning_rate: 0.0003
    8.       beta: 0.01
    9.       epsilon: 0.8
    10.       lambd: 0.3
    11.       num_epoch: 3
    12.       learning_rate_schedule: linear
    13.     network_settings:
    14.       normalize: true
    15.       hidden_units: 128
    16.       num_layers: 2
    17.       vis_encode_type: simple
    18.     reward_signals:
    19.       extrinsic:
    20.         gamma: 0.9
    21.         strength: 1.0
    22.     max_steps: 5000000
    23.     time_horizon: 1000
    24.     summary_freq: 8000
    Last edited: Aug 25, 2023