# Help Wanted How to improve the mean reward of my project (Observations, .yaml file, etc)

Discussion in 'ML-Agents' started by Apolok99, Jul 25, 2021.

1. ### Apolok99

Joined:
Oct 14, 2019
Posts:
4
Hi.
I am a Spanish university student doing my final degree project with ML-Agents and I would like to ask for some help to improve it. I am trying to create an artificial intelligence with the help of ML-Agents (release 12, Package v1.7.2) that creates a defense animation in a 3D fighting videogame. For this, the project has the following characteristics:
1. Character defender: A full rigged character that has different colliders on his body (head, chest, legs and arms) and grabs a sword (the agent of this project). If the agent collides with one of these colliders, we set a reward of -1 and we end the episode.
2. Sword defender (Agent): A sword with a rigidbody and a collider (and all the stuff of the agent). His objective is to collide with the enemy sword, trying to imitate a "real" clash of swords, and thus defend the defender. For that to happen, I have a ChildObject placed right in the middle of each sword. Using "collision.GetContact(0).point" of the function "OnCollisionEnter", I calculate the distance from this ChildObject of the two swords to the point of collision. The shorter this distance, the greater the reward.
3. Enemy: A full rigged character with a sword that has an attack animation. This sword has a collider. When the animation is played, if the sword collides wit any collider of the charcater defender, we set a reward of -1 and we end the episode.
Also, to simulate that the defender is moving the sword, I use inverse kinematics. If the position of the agent goes beyond certain distances (distances hard coded), we set a reward of -1 and we end the episode. This also prevents the agent from going straight to the enemy sword.

Actions
The agent can move "freely" in all axis and rotate in X and Z (not in Y because it only spins itself). These actions are discrete actions, as the defender should be able to move and rotate the sword at the same time.

Observations
Currently, the variables observed by the agent are:
• Position, rotation and velocity of the defender's sword (the agent).
• Position of each part of the defender that has a collider.
• Position of the enemy sword.
The space size of this vector is 97 and I have a value of 5 on the argument of Stacked Vectors.

Configuration (.yaml)
Code (CSharp):
1. behaviors:
2.   DefendSamurai:
3.     trainer_type: ppo
4.     hyperparameters:
5.       batch_size: 128
6.       buffer_size: 2048
7.       learning_rate: 0.0003
8.       beta: 0.005
9.       epsilon: 0.2
10.       lambd: 0.95
11.       num_epoch: 3
12.       learning_rate_schedule: linear
13.     network_settings:
14.       normalize: false
15.       hidden_units: 640
16.       num_layers: 4
17.       vis_encode_type: simple
18.     reward_signals:
19.       extrinsic:
20.         gamma: 0.99
21.         strength: 1.0
22.     keep_checkpoints: 5
23.     max_steps: 2000000
24.     time_horizon: 2048
25.     summary_freq: 50000
Training

The enemy has a list of animations. In an episode, I check which animation comes next and I move the position of the character enemy to perform the animation. Therefore, the resulting neural network will have to be configured (the weights and so on) to stop the enemy sword in all animations.

Results
Obviously, if I am making this thread it is because the results are not as expected. Low mean reward and the sword bearly collide with the other sword. I have attached two images of the Tensorboard in case it is useful to analyse what is happening.

So... What can I do to improve this?
I have several ideas that can bring me closer to the solution. Firstly, observations. It may be that the position of the enemy sword is not enough and the agent needs more information. Secondly, the .yaml file. I have a very brief understanding of it and maybe if I change some of the configuration parameters it will make it easier for the agent to learn.

If you are willing to help, do not hesitate to ask me anything about the project that you have doubts or that is not explained in this thread.

And if you've read this far, thank you for your time and I hope you can give me a hand with this. I will be eternally grateful.

Kind regards.

File size:
81.4 KB
Views:
37
File size:
92.5 KB
Views:
34

### Unity Technologies

Joined:
Dec 6, 2018
Posts:
150
Before changing the observation space, I'd try for running it MUCH longer than 2M steps. Maybe set max_steps to 20M or more. If it becomes unstable, I'd try increasing the batch size.

Basically the learning rate is being annealed, so even though it looks like the reward has converged, it's possible that it hasn't.

Apolok99 likes this.
3. ### Apolok99

Joined:
Oct 14, 2019
Posts:
4
First of all, thank you for replying. I have been doing trainings with quite a few steps and the results are as follows. Although it seems that I have fallen short with the max_steps, it seems that there is something else behind, since my first tests were with 20M and the results were similar (the same final shape that is improving but staying at 0.3 average reward). I have uploaded in this answer the results obtained with TensorBoard.

I think I have different ways to improve this, as you will understand that an average reward of 0.2 over 1 is very low.

Firstly, the observations. I mentioned it in my previous post because although it looks like a complex scenario on the outside, it is still a problem in which the agent must approach a point of an object to another. Could it be that the agent is missing information? Or rather, could he be helped by some extra information?
The only restrictions you have are the limits of how far you can move the sword and if it hits any part of the defender's body. Could it be that it is too restrictive and that is why he has it hard to learn? I would also like to mention the formula that calculates the reward: If the swords collide -> ((max_dis - distance ) / max_dis ) / 2. With this I calculate for each sword a reward between 0 and 0.5. The distance is calculated with the method Vector3.Distance() between the middle point of one sword and the collision point.
I have also read about the normalization of the observations but having different attack animations with different starting points of the sword...

Secondly, the YAML file. I have seen how different a training can be if you adjust the parameters of this file. In that case, I would like to get some guidelines to improve it as much as possible. Here is the YAML file I used.
Code (CSharp):
1. behaviors:
2.   DefendSamurai:
3.     trainer_type: ppo
4.     hyperparameters:
5.       batch_size: 2048
6.       buffer_size: 20480
7.       learning_rate: 0.001
8.       beta: 0.005
9.       epsilon: 0.2
10.       lambd: 0.95
11.       num_epoch: 3
12.       learning_rate_schedule: linear
13.     network_settings:
14.       normalize: false
15.       hidden_units: 640
16.       num_layers: 4
17.       vis_encode_type: simple
18.     reward_signals:
19.       extrinsic:
20.         gamma: 0.995
21.         strength: 1.0
22.     keep_checkpoints: 5
23.     max_steps: 30000000
24.     time_horizon: 1000
25.     summary_freq: 50000
Thirdly, more steps. Obviously, this is my last option since it took me 48 hours of complete training to do 30M steps.

And lastly, anything that might help the agent that I am not aware of. I am open to any ideas and I am willing to share any information of the project.

So, again, I ask help to anybody that may have an idea in how to improve it. Thanks again for reading this far.

Best regards.

File size:
64.4 KB
Views:
28
File size:
108.1 KB
Views:
27
unityunity