# Resolved ML Agents - Brain favors specific solution despite equal rewards

Discussion in 'ML-Agents' started by rahulchawla2801, Aug 25, 2023.

1. ### rahulchawla2801

Joined:
Oct 5, 2021
Posts:
34
Description
I'm facing a problem with Unity ML Agents. I've got 10 different positive solutions out of 200, all giving the same +1 reward, and other 190, -1 reward. But my trained brain is oddly leaning towards one solution more than the other 9.

Details
• I'm working on a relatively straightforward practice problem where I apply an impulse to a rigidbody based on input from the OnAction function. The reward is determined by where the rigidbody lands after the impulse.
• To avoid conflicts, I've incorporated a boolean lock in the OnActionReceived function, ensuring that no new impulse is applied while the rigidbody is waiting for a reward.
• One variable that distinguishes the positive solutions is the time interval between the impulse and the reward. However, I've attempted to mitigate this discrepancy by uniformly delaying the reward function for all positive solutions.
• I've exhaustively experimented with a wide range of parameters, including:
• beta: [0.001 to 0.5]
• epsilon: [0 to 0.95]
• lambda: [0 to 0.95]
• strength: [0.1 to 1]
• gamma: [0, 0.9]
• For each parameter combination, I've conducted rigorous training with 1 million steps. However, either the cumulative reward doesn't converge or, when convergence occurs, the agent favors only one specific positive solution during validation.
• I've implemented the action input using 3 discrete values that fall within the range of [0, 6]. These values are obtained using the actions.DiscreteActions function. These three discrete values are then used collectively to generate a Vector3 impulse that needs to be applied to the rigidbody.

[*]Since I am running around 1000 agents in parallel, I have created a simple pool of 1000 rigidbodies which are initialized and SetActive() at the onset of the action cycle for each agent.

That outcome that I want

I want the training process to ensure that all 10 solutions I've identified have nearly equal importance. This way, when I use these solutions in my game, the AI brain should randomly select any one of these 10 solutions, rather than favoring just one of them.

One of the Yaml files
Code (JavaScript):
1. behaviors:
2.   AO:
3.     trainer_type: ppo
4.     hyperparameters:
5.       batch_size: 128
6.       buffer_size: 20480
7.       learning_rate: 0.0003
8.       beta: 0.01
9.       epsilon: 0.8
10.       lambd: 0.3
11.       num_epoch: 3
12.       learning_rate_schedule: linear
13.     network_settings:
14.       normalize: true
15.       hidden_units: 128
16.       num_layers: 2
17.       vis_encode_type: simple
18.     reward_signals:
19.       extrinsic:
20.         gamma: 0.9
21.         strength: 1.0
22.     max_steps: 5000000
23.     time_horizon: 1000
24.     summary_freq: 8000
25.

Last edited: Aug 25, 2023