Question ML Agents Training Unstable, AI not learning well

Nrike458 · Dec 20, 2022

Hey All,

I've been having some issues trying to train my AI using ML Agents for a car racing game. I was able to get setup and follow along with several basic tutorials, but when trying to put that code into use for my own game, the results aren't going quite as I expected. I initially followed along with Code Monkey's AI car tutorial, linked here, and was able to get the training scene setup pretty easily, along with all the code necessary to work with my custom car physics and Checkpoint system. (Quick summary, he uses both BC and GAIL for some imitation learning, with light usage of Reinforcement Learning, and switches to more RL later on in the training). However, when I went to train using his config files and a demo I created with my own track, the AI took some time to get things figured out. Meanwhile, the Mean Reward keeps bouncing around, with an upward trend, but not consistently going upwards. Towards the later part of the training, about 200-500k steps(this was as far as I wanted to go for some initial testing before going for "real" training for multiple hours), the AI would be able to make it around a turn or 2, but very often would crash into the inside wall before turn 1. Other times, they would make it past turn 2 and well into the long back straight before crashing. Also, I've had this weird issue where the AI will just not move for an entire episode. With his code, he mentioned that he had something decent going fairly early on, and it looked like it consistently improved, and didn't bounce around as hard as mine does. I've tried tweaking some of the hyperparameters, based on this guide, with some improvements being made, but again, I can't seem to find anything that gets the AI around the track consistently. I've got my config file listed below, along with screenshots of the demo scene, my Agent setup, Tensorboard graphs, and the CMD prompt as well for my latest training. I'd say that this round had the best results(highest mean reward spike too), but I'm not getting any of the results I would've preferred yet. For rewards, my AI gain a point for crossing the correct checkpoint, a slight point increase for their speed level(this was to encourage movement), and they lose a point for crashing into a wall, along with the episode ending. For my demo scene, pictures listed, the green cubes represent checkpoints, and the red represent the walls.

behaviors:
AIDriving:
trainer_type: ppo
hyperparameters:
batch_size: 256
buffer_size: 10240
learning_rate: 1.0e-4
beta: 5.0e-4
epsilon: 0.2
lambd: 0.99
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: true
hidden_units: 128
num_layers: 2
reward_signals:
extrinsic:
gamma: 0.99
strength: 0.1
gail:
gamma: 0.99
strength: 1.0
demo_path: Demos/TrackLeftTurn4.demo
behavioral_cloning:
demo_path: Demos/TrackLeftTurn4.demo
strength: 1.0
max_steps: 5000000
time_horizon: 64
summary_freq: 20000

If anyone can help take a look and see what I can do to get a good AI going, I would be very appreciative. Here is the link for the pictures(had to use google drive to upload them all).

Thank you!

Nrike458 · Dec 21, 2022

I forgot to mention a couple things in my initial posting. I've tried setting the timescale to 1 in the CLI when starting training, and that massively helped out the AI, but still didn't get them to a proper level yet. I've done quite a few test trial runs on the left turn track, and tried a mix of Reinforcement and Machine learning, again, with little improvements. I can post the code for the AI car controller, checkpoint system, and agent code, if needed.

Nrike458 · Dec 21, 2022

So, I ran 3 more test trials this evening. I started off by disabling Imitation Learning(both GAIL and BC), and instead let the AI just figure things out on their own using the previous config setup. I also duplicated the training environment to 12 setups to allow the AI to train a bit faster. After about 500k steps, the AI didn't really get too far, maybe a couple of checkpoints at best. Next up, I tried tweaking the Raycast sensor length, and increased it up to 40 from 20, and after stopping the training at about 500k steps, I was sitting at a bit lower of an avg. reward. Throughout both of these runs, I was still experiencing some issues with the AI being unstable/inconsistent with their training. I tried a different config file that also relies on using just Reinforcement Learning, gathered from a tutorial on ML Agents cars found here, and that seemed to help things out quite a bit. At 500k steps, I was sitting quite a bit higher than I was before. I let the AI train for about 45 minutes, reaching 2 million steps, and they were doing quite well. Still a bit of an up and down roller coaster with training, but it seemed a bit more stable/consistent. At this point, the AI are still trying to figure out how to go faster/optimize their lines, as the training stopped about half way through my test track. Tomorrow, I plan on testing out Imitation Learning some more with the new config files, and letting the AI train for some more time before I stop things. I'll report back on my findings once I finish up some more training.

GamerLordMat · Dec 21, 2022

Nrike458 said: ↑

So, I ran 3 more test trials this evening. I started off by disabling Imitation Learning(both GAIL and BC), and instead let the AI just figure things out on their own using the previous config setup. I also duplicated the training environment to 12 setups to allow the AI to train a bit faster. After about 500k steps, the AI didn't really get too far, maybe a couple of checkpoints at best. Next up, I tried tweaking the Raycast sensor length, and increased it up to 40 from 20, and after stopping the training at about 500k steps, I was sitting at a bit lower of an avg. reward. Throughout both of these runs, I was still experiencing some issues with the AI being unstable/inconsistent with their training. I tried a different config file that also relies on using just Reinforcement Learning, gathered from a tutorial on ML Agents cars found here, and that seemed to help things out quite a bit. At 500k steps, I was sitting quite a bit higher than I was before. I let the AI train for about 45 minutes, reaching 2 million steps, and they were doing quite well. Still a bit of an up and down roller coaster with training, but it seemed a bit more stable/consistent. At this point, the AI are still trying to figure out how to go faster/optimize their lines, as the training stopped about half way through my test track. Tomorrow, I plan on testing out Imitation Learning some more with the new config files, and letting the AI train for some more time before I stop things. I'll report back on my findings once I finish up some more training.
Click to expand...

500.000K steps is nothing. I trained my 2D-football game for 100 mil steps to achieve good results. So let train for a little longer to have comparable results, sometimes an algorythm needs longer to start improving but in the end gives better results

Nrike458 · Dec 21, 2022

Yeah, I'm starting to realize that. I got up this morning, and let my Reinforcement Learning AI go for quite some time as I did chores and other work. I let them get up to about 7.5 million steps, and they were getting about 3/4 of the way through the track. With the same steps per episode value, I could complete a bit over a lap. Seems like I do need to train them some more, and of course, get them training for some time on a track with both left and right turns. For giggles, I tried setting up the Imitation Learning at that 7.5 mil. step mark, and the results instantly tanked. They started to have difficulties getting past the line, and that was with a .2 value for strength on Gail and BC. What was odd was that Gail errored out in the CLI and failed to load. I'm wondering if since the training was so far along, if I couldn't activate Gail in a resume call. For now, I think I'm going to setup a real test track, and let the AI train for at least 10 million steps, possibly more, to see if I can get the results I want. I might try increasing the reward that is given per step based on speed, as they get .0001 * their speed for an extra little reward, but that might not be influential enough. Thanks for the input on things; I'll keep messing around with this.

GamerLordMat · Dec 21, 2022

Nrike458 said: ↑

Yeah, I'm starting to realize that. I got up this morning, and let my Reinforcement Learning AI go for quite some time as I did chores and other work. I let them get up to about 7.5 million steps, and they were getting about 3/4 of the way through the track. With the same steps per episode value, I could complete a bit over a lap. Seems like I do need to train them some more, and of course, get them training for some time on a track with both left and right turns. For giggles, I tried setting up the Imitation Learning at that 7.5 mil. step mark, and the results instantly tanked. They started to have difficulties getting past the line, and that was with a .2 value for strength on Gail and BC. What was odd was that Gail errored out in the CLI and failed to load. I'm wondering if since the training was so far along, if I couldn't activate Gail in a resume call. For now, I think I'm going to setup a real test track, and let the AI train for at least 10 million steps, possibly more, to see if I can get the results I want. I might try increasing the reward that is given per step based on speed, as they get .0001 * their speed for an extra little reward, but that might not be influential enough. Thanks for the input on things; I'll keep messing around with this.
Click to expand...

Sry I Haven't tried out imitation traning out yet, so I cant help you there.
But IMO the best results happen if you give just 1 point for completing the objective (or 1 - (stepsNeeded/maxsteps) if time is crucial). In your case that would be the checkpoints. Dont give points for speed, bc it will then drive left and right and optimize speed and not the objective. In my experience giving points for other stuff never works as intended. Give it as values the local Speed and local angularVelocity (transform.InverseTransformVector(rb.speed)), maybe position, ray-sensors to know where the wall is, and some clue where the goal is if it isnt always infront.

Also keep in mind that you will overfit (make it work just on that oe case) your model when training only on one map. Consider randomize your maps, maybe give the next turn/part as an observation( the map through an Unity Gridsensor aka like an minimap). It has then a way to predict (else, what is position worth??)

Nrike458 · Dec 21, 2022

So, just to clarify, you would recommend subtracting a small amount of points per step to encourage the AI to get going quicker, correct? I do see how giving them points for speed could potentially cause them to focus on the wrong object. I noticed that mine are currently "swaying" left and right while driving. Currently, I have the ray sensors on the car that find and pass along the walls and the checkpoints. Along with that, I give them info on the Dot product between current position and the next checkpoint. However, this seems to pass a distance value (ex. 0-40 depending on distance to checkpoint), and not a value between -1 and 1 for the direction to the checkpoint, like the documentation would suggest. Maybe that's how it's intended. Either way, they should have some ideas for where the next correct checkpoint is.

For randomizing the tracks, I currently use a modular track piece kit, meaning I have set pieces for corners, straights, switchbacks, etc. Obviously training on just an oval would mean they have 0 knowledge of going right, so I would need to do a for real training on a more difficult track. I currently have about 12 agents running at a time to gather knowledge. Would you recommend creating 1 track that's more difficult for all agents to train on at once, or go through and create a few tracks for a smaller number of agents to train on at a time?

GamerLordMat · Dec 21, 2022

Nrike458 said: ↑

So, just to clarify, you would recommend subtracting a small amount of points per step to encourage the AI to get going quicker, correct? I do see how giving them points for speed could potentially cause them to focus on the wrong object. I noticed that mine are currently "swaying" left and right while driving. Currently, I have the ray sensors on the car that find and pass along the walls and the checkpoints. Along with that, I give them info on the Dot product between current position and the next checkpoint. However, this seems to pass a distance value (ex. 0-40 depending on distance to checkpoint), and not a value between -1 and 1 for the direction to the checkpoint, like the documentation would suggest. Maybe that's how it's intended. Either way, they should have some ideas for where the next correct checkpoint is.

For randomizing the tracks, I currently use a modular track piece kit, meaning I have set pieces for corners, straights, switchbacks, etc. Obviously training on just an oval would mean they have 0 knowledge of going right, so I would need to do a for real training on a more difficult track. I currently have about 12 agents running at a time to gather knowledge. Would you recommend creating 1 track that's more difficult for all agents to train on at once, or go through and create a few tracks for a smaller number of agents to train on at a time?
Click to expand...

"So, just to clarify, you would recommend subtracting a small amount of points per step to encourage the AI to get going quicker, correct?" no. What I meant is to only give a point ONCE when they go through the checkpoint. You penalize slowness with giving less points when crossing the checkpoint (1 - (stepsNeeded/MaySteps), something like this) but again only when they arrived at the checkpoint.
"I do see how giving them points for speed could potentially cause them to focus on the wrong object."
Yes, bc like you noticed you dont want them to drive fast, but to get to the goal fast (which doesnt have to end in fast driving, maybe it is better to take a slow turn)

short answer: dot is only in [-1,1] if you use for both normal Vectors (length 1)
the dot product is basically the projection of one Vector to another. Basically you say: the vector V1 is now a new 1D axis, a line; You project than V2 (image the shadow of V2 on V1) to be the new value.
So if V1 is the 3DVector (1,0,0), and V2 is (1,1,1 ), dot give you = 1*1+ 1*0 + 1*0 = 1, means V2 projected on the x-axis is 1. If one is norm and the other is not, you get the position/scale on/of the normed Vector.

"Would you recommend creating 1 track that's more difficult for all agents to train on at once, or go through and create a few tracks for a smaller number of agents to train on at a time?" Depends on your goal. Can they interact with each other? If not then put them all in at the same time (check off self collision in player settings). Most optimal would be to find a method where you train them in an endless loop with randomly generated maps that resemble real maps, in order to have cars that can drive in any map you throw them in.

Nrike458 · Dec 21, 2022

Ok, thank you for the responses! I'll take a look, modify my reward code and dot product code, as it is reading the second set of values like what you mentioned, and try to create some random tracks for them to test on. In the final game, the goal is to have the player and 4 AI on the track at once, with collisions on. I'm assuming I would need to find a way for them to avoid hitting other cars at some point, maybe by allowing the ray sensor to see the Player/AI cars, and giving a slight negative reward for colliding with others. I currently have it to where there is 1 car per test track, not multiple cars per track, as I couldn't figure out a way for them to not collide without breaking the physics by removing colliders. I'll look into that setting you mentioned. If I were to have a way to randomly generate tracks, or even choose from a random list, I'm assuming you would still try to have multiple AI on the same track at once, with collisions on, to closer mimic the real deal, correct?

GamerLordMat · Dec 22, 2022

Nrike458 said: ↑

Ok, thank you for the responses! I'll take a look, modify my reward code and dot product code, as it is reading the second set of values like what you mentioned, and try to create some random tracks for them to test on. In the final game, the goal is to have the player and 4 AI on the track at once, with collisions on. I'm assuming I would need to find a way for them to avoid hitting other cars at some point, maybe by allowing the ray sensor to see the Player/AI cars, and giving a slight negative reward for colliding with others. I currently have it to where there is 1 car per test track, not multiple cars per track, as I couldn't figure out a way for them to not collide without breaking the physics by removing colliders. I'll look into that setting you mentioned. If I were to have a way to randomly generate tracks, or even choose from a random list, I'm assuming you would still try to have multiple AI on the same track at once, with collisions on, to closer mimic the real deal, correct?
Click to expand...

"I couldn't figure out a way for them to not collide without breaking the physics by removing colliders. I'll look into that setting you mentioned." playersettings->physics-> than you checkof in the grid self collision for your layer (you have to create one maybe before)

yes,I would always in machine learning (also in generell) try to use for training the same test data/ training method as for the application. So the better training fits your endgame the better it will work
Maybe self play would be something for you? Checkout the tennis example that is hidden somewhere in the Unity examples. It is a little more effort but basically you set up your agent to play against past version of itself. After you get okay results with one car (make it drive somehow)with ppo whatever, I would definitely check it out. I just dont know how or if it works with mimic learning (which I believe is not even necessary in easier games like 2D race car games, ppo should work out too)

Nrike458 · Dec 22, 2022

Ooo, I believe I saw that setting map/chart thing. That'll be nice. I guess in the training, there will be OnTriggerEnters to lose points for the AI if they crash, and I'll need to setup the raysensor to see the other AI.

Last questions for now, unless I come up with more lol. If I do setup a random track selection for the AI to train on, what would be the best way to implement it into the Episode starts? I was thinking that OnEpisodeBegin() to have the agent do a random selection, but if I train with multiple AI on the same course, there will be about 4 of those choices made. Is there a way for 1 AI to choose the random course? Maybe just have a boolean for 1 AI to select the track? In that case then, the AI should all complete their episodes at the same time if their step count is the same, right? So like 1200 steps should complete at the same time for everyone? I guess that would mean I would need to modify my code to remove the episode stop if they crash into the wall, and add in some points for staying on/in the wall collider. Does my logic seem to follow for setting up the AI for the actual training?

GamerLordMat · Dec 22, 2022

Nrike458 said: ↑

Ooo, I believe I saw that setting map/chart thing. That'll be nice. I guess in the training, there will be OnTriggerEnters to lose points for the AI if they crash, and I'll need to setup the raysensor to see the other AI.

Last questions for now, unless I come up with more lol. If I do setup a random track selection for the AI to train on, what would be the best way to implement it into the Episode starts? I was thinking that OnEpisodeBegin() to have the agent do a random selection, but if I train with multiple AI on the same course, there will be about 4 of those choices made. Is there a way for 1 AI to choose the random course? Maybe just have a boolean for 1 AI to select the track? In that case then, the AI should all complete their episodes at the same time if their step count is the same, right? So like 1200 steps should complete at the same time for everyone? I guess that would mean I would need to modify my code to remove the episode stop if they crash into the wall, and add in some points for staying on/in the wall collider. Does my logic seem to follow for setting up the AI for the actual training?
Click to expand...

good point, I haven't done anything similar to your project yet. If you want to train them at the same time like it should be in the end game, you need to think more about the logic you want to achieve. If one crashes, why should you stop the others from training? Maybe just respawn the crashed car somehow. Generally it I noticed that having a playable game already or in mind really helps to design the AI.

I think Academy is the class that is made by Unity to handle that scenario when multiple instances are linked to one playground. but your idea of having one dominant (just a bool) to choose seems reasonable

Nrike458 · Dec 22, 2022

Yeah, I get what you are saying. I'm thinking it'll be a lose point scenario if they collide/enter the trigger of each other with self collisions off. This, combined with the data of where the other AI are at, should encourage them to not collide with each other. Worse case scenario, I can add in some code to autoflip the cars, which I already have.

Just to double check, if they don't crash, and have nothing to stop the episode early, theoretically, the episode length that is set, in my case it was 1200, should mean they all finish their episode at the same time, and can reset at the same time, correct? If that's the case, I'm assuming I would want to increase the amount of steps for a larger track, to something that allows me to complete a lap of most of the tracks, right?

GamerLordMat · Dec 23, 2022

Nrike458 said: ↑

Yeah, I get what you are saying. I'm thinking it'll be a lose point scenario if they collide/enter the trigger of each other with self collisions off. This, combined with the data of where the other AI are at, should encourage them to not collide with each other. Worse case scenario, I can add in some code to autoflip the cars, which I already have.

Just to double check, if they don't crash, and have nothing to stop the episode early, theoretically, the episode length that is set, in my case it was 1200, should mean they all finish their episode at the same time, and can reset at the same time, correct? If that's the case, I'm assuming I would want to increase the amount of steps for a larger track, to something that allows me to complete a lap of most of the tracks, right?
Click to expand...

yes. 1200 is your maxstep but offcourse it can offset if you start one agent one time early. I would recommend using self-play with 5 different teams

Nrike458 · Dec 23, 2022

Perfect, thank you for taking the time to go through and answer my questions on all of this. ML is pretty neat, but definitely takes some know how to get things going. It'll be a few days for me to get everything setup and training, but I'll report back on my findings and results!

GamerLordMat · Dec 23, 2022

Nrike458 said: ↑

Perfect, thank you for taking the time to go through and answer my questions on all of this. ML is pretty neat, but definitely takes some know how to get things going. It'll be a few days for me to get everything setup and training, but I'll report back on my findings and results!
Click to expand...

No problem, I hope that I could help! I am curious about your results

Nrike458 · Jan 3, 2023

Just to give you an update on things. I've got all of the tracks created/modeled, along with all of their checkpoints and wall colliders. I still need to work on the code, but I should have that done in the next day or so. I'll then begin training!

Nrike458 · Jan 14, 2023

Hmmm, Initial testing/training is not going well. After around 10m steps, they still haven't gotten through a single checkpoint. I tried training again, this time initializing from the previous successful training, and still no go. As I'm typing this, I think the changes in the code are too different for the previous training to be of any use. I have a couple thoughts of what I need to change next. So, I pass along the dot product of the next checkpoint(this isn't the normalized one, just distance. I couldn't ever figure out how to get everything normalized, as .normalize isn't working), the speed of the car from the car controller(this is converted to mph, if that conversion is correct), and the vector3 of the next checkpoint. I have a raycast thing that detects the walls, checkpoints, the different AI tags, and the player tag. It has a layer mask for the Walls, AI, Player, and Checkpoint. I have a couple of thoughts of what to try. I think the speed variable might be throwing off the AI from going. Maybe they are thinking to keep that one the lowest? I'm also wondering if I should have 2 separate raycasts, one for the positive objectives(checkpoints), and another one for the negative objectives(walls, cars, AI, etc.). Any thoughts on that?

Nrike458 · Jan 14, 2023

Ok, I might've found a bug in my code. Seems that the first initial checkpoint(crossing the finish line), doesn't give any rewards/points. Trying to fix that now.

Nrike458 · Jan 14, 2023

I tried a run of the training without speed, didn't get things going. Fixed the bug, still nothing. Hmmm, back to the drawing board.

Nrike458 · Jan 15, 2023

So, I did some more trial and testing today, hoping to get my AI going properly. To start off, I added back in the speed observation, and changed it to velocity and angular velocity. It seemed to help a bit, but in 500k steps, they still couldn't get to the first checkpoint. I then went through and double checked that all of the observations and code was working properly using Heuristics, and it seemed that everything was working as intended. I then switched the next checkpoint position observation to localPosition instead of global position, thinking it was throwing off the AI. After 2.2 million steps of training, they still didn't get to the first checkpoint. Next up, I added the localPosition of the car as an observation, and let it train for a bit. After 3.4 million steps of training, they still didn't get to the first checkpoint. I then figured out how to get the dot product observation returning the angle numbers(-1 to 1) instead of position to the next checkpoint. After 3 million steps, Unity froze up when I tried to stop the training(not sure why). They started to move a bit more frequently, and only hit the first checkpoint once or twice throughout all that time.

I am at a complete loss at this point as to why they haven't figured out how to hit the checkpoint, and start completing laps. I think I might need to give a slight reward as they get closer to the checkpoint to encourage them to actually get to the objective(next checkpoint), and see if that helps. Before, they got points for speed, but that was causing more harm than good, so maybe a change in distance should be motivation?

hughperkins · Jan 15, 2023

You probably have some bugs in your code, so create tests for *everything*. Observations, rewards, etc.

But other than that, just start small: put the first checkpoint really near the start, so decent chance of hitting it by accident. Then gradually make the distance between checkpoints higher.

NanushTol · Jan 15, 2023

I am in the process of training a racing hover craft
going fairly well so far, my advices would be:

1. use a curriculum, and each lesson add reward signal to introduce another sub-task, in my case it was getting close to the target, and orienting the craft, these was 2 lessons. keep the last lessons reward in the next lessons to not loose progress and consistency.

2. when my agent hits a wall its a hard -1 reward set + end episode, not a negative add reward but a set, this made a big difference for me.

3. small frequent add reward worked well when guiding the agent to the desired behavior, but again when the actual lesson goal was reached a hard "set reward" of 1 worked best.

4. I found working with in a pre-decided value range made it easier for me to plan and design lessons, so I work between -1 and 1, where the total accumulated reward an episode can have is 0.6, 1 is for goal success, -1 is for failure.

5. I've built my own system to manage lessons so I only pass a single environment variable from the python env to unity, and there I have my setup to control the lessons.

7. the agents learning "quality" is much higher when I randomize parts of the environment.

6. 500K is really a small amount of steps, takes me about 60M to train an agent to be able to fly in a complex environment chasing targets & shooting them down. (the driver only, ie: collision avoidance, movement & orientation. targeting is being controlled by another brain)

*in the attached image you can see how I layout my lessons in terms of rewards & randomizations.
** I'm splitting the lessons to multiple stages so I can experiment and initialize more advanced stages with a previously trained agents

Nrike458 · Jan 16, 2023

hughperkins said: ↑

You probably have some bugs in your code, so create tests for *everything*. Observations, rewards, etc.

But other than that, just start small: put the first checkpoint really near the start, so decent chance of hitting it by accident. Then gradually make the distance between checkpoints higher.
Click to expand...

I've definitely went through and tested all of the observations and rewards in my code a few times now. To my knowledge, everything should be working as intended if a player goes through and tries to race using Heuristics. The only thing that I can't really test to ensure it is working properly(giving proper info), to the ML side of things is the Ray Perception Sensor that detects things like walls, the checkpoints, and the other cars on the track.

As I'm typing this, I'm reading through some of the ML-Agents documentation and trying to differentiate between the detectable tags and the layer mask in the Perception Sensor, as it didn't seem quite obvious to me at first. It seems that detectable tags are obvious enough in meaning; they allow the sensors to detect objects. The Layer Mask tells the sensor to "ignore certain types of objects when casting." So, with me having the checkpoint, walls, etc. in the layer mask, it is simultaneously telling the AI to both detect and ignore all of the things I want it to look for... I definitely am going to give it a run tomorrow with that slight tweak to see if that gets things going.

I have the first checkpoint about 30m from the front most AI. That was about where it was at when I did my initial testing/training. But, I can definitely try to move it a bit closer.

@NanushTol
That's definitely an interesting way to go about training, but that definitely does influence positive behavior/traits for the AI. I might need to look into that if I can't get things figured out/going in the right direction.

hughperkins · Jan 16, 2023

Ah Glad you solved the issue (i.e. removing the layers you want to detect from the layer mask)

Nrike458 · Jan 18, 2023

So, I tried a few sets of training over the past couple days, here is what I had noted for results on trying various things.

Tried removing everything from the ray perception sensor layer mask(set it to nothing), and they were able to get thorugh the first few checkpoints, but couldn't actually make it through the first turn(tracks are randomized, some have long straights at the start, others turn immediately). This was in a test training of about 2-3 million steps.
Curious, I looked more into the ray layer mask on the perception sensor, and got mixed results in my findings. Some say that it is set where a selected layer is what it collides with and looks for, others say what the documentation said. I tried out a run where the layer mask was set to the wall layer only, and needless to say, the AI didn't really get anywhere.
It was here that I noticed that the 4th AI(out of 4) wasn't actually getting a checkpoint count updated properly in the Checkpoint system. I added a debug.log to the direction dot of the checkpoints, and it would work for the first checkpoint only on the 4th AI. Once you passed the first checkpoint, it would read a negative number, indicating the next checkpoint was behind the AI. Obviously, this was not the case. I tried another AI, this time the 3rd AI, and it was grabbing the proper checkpoints for the dot product on multiple episodes and tracks, but read negative numbers if it was turning to the left(from the starting line). This is definitely not what I had expected would be returned for the dot product.
I went through and debugged my code a bit, and discovered the 4th AI bug was due to a wrong variable being entered into an if statement, which bugged out the dot product. Now, it is reading the proper positions for the next checkpoint, and is getting proper dot products, at least when going straight/right.
I experimented a bit with the dot product to see if I could get it to return a proper "1" value when looking at the checkpoint in the proper way, but I couldn't seem to get that part figured out. Currently, my direction dot looks like this "float directionDot = Vector3.Dot(transform.forward, (nextCheckpoint - this.transform.position).normalized);" with this being put onto the driver agent script, and attached directly to the AI car. I tried passing the next checkpoints forward position in the first part, but it wasn't reading the dot product correctly at all. I tried changing up the this.transform.position to localPosition, but it didn't change anything in the results. I probably do need to leave it as localPosition though, as if I have another track far out from the origin, results are going to be a bit wack.
But, when I ran a test with it set to localPosition, they started to veer straight to the left on the start. I added a public variable for the direction Dot so I could monitor them as they went, and the front ai were reading near 0 for their dot, whereas the back AI were getting their standard values... As I watched the other test track environments that were setup, they too were experiencing similar issues. On some episodes, the dot product would return totally wack values, and at other episodes, it was completely fine. There was no correlation to track that I could find. I have absolutely no idea why this is happening. They share the same code, why would it randomly bug out like that? Seems like I have some more debugging to do...
I tried another test without the dot product observation, and let it sit for about 11 million steps of training. The AI were a little bit better, but had some consistency issues past that point. They could get to the first turn sometimes, but not others.
I'm wondering if I need to do separate training for each of the tracks, or don't randomize the tracks on each episode. It seems the AI get confused and can't properly learn the tracks/environments and what they should/shouldn't do. I think that will be my next test tomorrow.

Anybody have any thoughts on that? Would physically changing the learning environment to an entirely different layout truly be causing me some issues? Would it then be best to do maybe 1 track per AI to have some consistency in its training?

hughperkins · Jan 18, 2023

The dot product is a relative thing. It doesn't matter whether you use vectors in local space or vectors in world space ,as long as you are consistent. The dot product is 1 when two vectors point in the same direction, 0 when they are at right angles, and -1 if they point in opposite directions. Imagine you draw two vectors on your screen, that are pointing in the same direction, i.e. parallel lines/vectors. The dot product would be 1, since they point in the same direction. If you rotate your screen, the dot product between the two vectors continues to be 1, no matter what space you use to represent the vectors. If you rotate the screen so that both vectors point upwards, now you are in 'local space' of each of the vectors. Dot product is still 1.

Code (csharp):

float directionDot = Vector3.Dot(transform.forward, (nextCheckpoint - this.transform.position).normalized

This looks perfectly fine to me.

Note that the dot product is symmetrical: it doesn't tell the agent whether it should turn left or right. If the agent is looking 20 degress left of the checkpoint, it will get the same dot product as if it is looking 20 degrees right of the checkpoint. So the dot product doesn't tell the aagent whether it should turn left or right. All it knows is it is looking in the wrong direction, but it has no clue how to corrct this.

Why not simply provide the actual vector (potentially normalized) from the agent to the target? Make sure this vector is in fact in local space.

hughperkins · Jan 18, 2023

By the way:

If you do want to use dot product, then what you can do is provide the agent with the results of *two* different dot product:
- dot product with Vector3.forward
- dot product with Vector3.right

The agent is now able to differentiate between looking too far right of a target and too far left of a target. If the agent is looking at the target, dot product with Vector3.right will be 0. If the agent is looking too far right of the target, dot product with Vector3.right will be < 0. If the agent is looking too far left of the target, dot product with Vector3.right will be > 0.

Alternatively, if you have both dot products, you can calculate the angle (0 to 360, or -180 to 180, or normalize one of these, e.g. by dividing by 360). However, this involves some trigonometry. Easiest to just feed either the two dot products to the agent, or the vector to the checkpoint, in local space.

Nrike458 · Jan 21, 2023

That makes sense. I see how that should be setup. For whatever reason, mine was doing funky things, so I ended up removing it for this next round of training. After 25 million steps of training, almost 7 hours of training, it seems like progress has flatlined. They start out very slow, taking almost 10 million steps to hit the first checkpoint consistently, then have a ton of progress to get to the next 7 or so checkpoints within the next 5 million steps, but from there, I see no improvement in things. They aren't able to take a left turn on the start sometimes or be able to get past the 10th or so checkpoint. Once they hit a certain point on some tracks, they lock full brakes, and come to a stop, and start backing up for whatever reason. Other times, they just start driving off into the sunset and never looking back. The cumulative reward graph shows constant spikes up and down. I am honestly stumped at this point as to what in the world they are doing.

Just to recap, for observations, I'm passing the velocity and angular velocity from the rigidbody, and am passing along the next checkpoint local position and current local position. Along with that, I have a raycast perception sensor 3d that has detectable tags for the checkpoints, walls, AI, AI2, AI3, and AI4, along with the player, and nothing is in the layer mask. The AI get a +1 reward for crossing the next checkpoint, and get -.5 for hitting the wall, and get -.25 for staying in the wall. My config file is listed below.

behaviors:
AIDriving:
trainer_type: ppo
hyperparameters:
batch_size: 120
buffer_size: 12000
learning_rate: 0.0003
beta: 0.001
epsilon: 0.2
lambd: 0.95
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: true
hidden_units: 256
num_layers: 2
vis_encode_type: simple
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
max_steps: 500000000
time_horizon: 1000
summary_freq: 12000
threaded: true

I'm not sure what to test and try at this point. I feel that at 25 million steps, the AI should've been getting through the tracks pretty decently, but still no dice yet. I've confirmed at this point that the observations are all passing the info along properly. Do you have any other suggestions of things that I should try?

hughperkins · Jan 21, 2023

Question, what is 'local position'? Wont that always be (0, 0, 0)? I think you can remove that

If you make the checkpoints close enough together, like join-the-dots, you dont need the ray sensors for now. You can remove that for now.

With one single value: the dot product of the normalized vector to the next checkpoint, with Vector3.right, the agent should be able to navigate between checkpoints.

Remove the walls, to simplify.

This task should be easy.

Then gradually increase distance between checkpoints. Task should still be straightforward, learn in a 100,000 steps or so. Shouldn't take millions I reckon. Make sure the first few checkpoints are quite near each other.

Then you can start to think about adding in walls and ray sensors, at that point, I reckon. You only need this once the distance between the checkpoints is long enough that the agent cant just drive between the checkpoints in a straight line any more.

hughperkins · Jan 21, 2023

Oh... you might want to randomize the starting location, to somewhere near the first checkpoint, but always in a different direction from the checkpoint, so that it has a chance of hitting the checkpoint randomly sometimes. And make sure to time out the episode if the checkpoint isnt hit within a few seconds or so.

Nrike458 · Jan 25, 2023

Alright, so I tried some things out over the past few days, and I'm definitely starting to get some results that I'm happy with. To start, I fixed the bugs with the dot product, and added in the 2 different dot products like you mentioned, and removed the local position. I did a basic training with a straight line course, but I kept in the walls that end the episode, and left the raycast sensors on the car. After about 600k steps, the AI were getting to the end checkpoint fairly consistenly. I then lengthened the distance between checkpoints, and was able to get them through that course in about 100k steps. Next up, I did a right hand turn training, and they got that in about 2.5 million steps. I then did a left hand turn training, and they got through that course in about 1.2 million steps. I then let the AI train on the real deal. 12 tracks that are randomized at the start of each episode. For this training, I just let 1 AI be on the test environment at a time, instead of all 4 of them from earlier. After 2 million steps, they were doing quite well, making it through most of the courses without issues. I let them train to 5.2 million steps, and they were getting a bit farther from earlier. To compare, when I ran the tracks myself, I was getting to about the same points that the AI were. However, I still have a couple tweaks I need to do. Some of the tracks have some switchback/chicane style turns, and the AI definitely didn't do those well. I might try and force some training on that, or modify how the checkpoint structure is setup for those specific instances. I also don't know how the AI will react with others. I'm assuming they will be able to detect each other, but will probably collide with each other. Once I get them fully figured out by themselves, I might need to try some training of them with each other to get an idea of how they will react, as I definitely don't want them to take each other out in the final game. Thanks for your help in figuring all of this out! I'll be sure to post another update once I get some more training done!

hughperkins · Jan 25, 2023

> I then lengthened the distance between checkpoints, and was able to get them through that course in about 100k steps

Nice! 100k sounds like about what I would expect

> After 2 million steps, they were doing quite well, making it through most of the courses without issues

Nice!

> However, I still have a couple tweaks I need to do. Some of the tracks have some switchback/chicane style turns, and the AI definitely didn't do those well

One thing you could do is make your agent start at random positions around the track, instead of always starting at the beginning. This guy found that worked pretty well https://youtube.com/watch?v=SX08NT55YhA&si=EnSIkaIECMiOmarE&t=574

Nrike458 · Jan 25, 2023

Oooooo. I do like the looks of that. I'm definitely going to give that a try, as that makes sense in my head as well. Also going to watch that entire video, as I could probably find a few extra pointers in there as well! Thanks for the suggestion!

hughperkins · Jan 25, 2023

Nrike458 · Feb 1, 2023

Sorry about the wait. Had some personal stuff come up, and had to fix quite a few bugs that arose from setting this all up. It seems to help in training quite a bit, and they seem to be doing quite a bit better. I did a training run for about 7 million steps, and they were doing pretty well, until they hit a tight 90 degree or 180 degree style corner. Larger, sweeping turns they did pretty well in, along with long straights, but they couldn't quite get those tighter turns down. When I drive these tracks, I'm definitely able to predict/see into the corner, which provides me a huge advantage over what these AI can "see." I think I'm going to look into adding in observations of the positions of the next few checkpoints from where the car is currently at to see if that improves anything. Thoughts on that?

hughperkins · Feb 1, 2023

Ok. Sounds like you've got things working, in the absence of momentum. ie if the car just always went a fixed speed, then things would work fine. So thats great!

So, then the next thing you need to add in, I reckon, is the ability for the AI to know how fast it is travelling. If you just report to the AI its position, and the locatino of the checkpoint, then the AI doesn't actually know if it's moving a 3mph, or 300mph.

The general solution to give the agent knowledge of speed is to pass in the last few frames of observations, not just the observation from the current time-step. So, then it can see that last frame it was 50 yards from the corner, and now it is 40 yards from the corner, so it can get a feel for how fast it is travelling.

Then, based on that, it might decide to slow down etc.

Nrike458 · Feb 2, 2023

I think I get what you are saying. Currently, I already have observations for the current time-step velocity and angular velocity. I should add some storage of the previous velocities, and pass those along to the AI, correct?

hughperkins · Feb 2, 2023

[deleted. accidentally thought I was in a different hread lol ]

Nrike458 · Feb 2, 2023

That makes sense. Definitely would want to know the "best" position/outcomes, which should be provided by the dot product and the raycast sensor, if that's working as it should. I think the previous velocity would be of use, as when I race RC cars and other racing games, having a sense of acceleration is pretty important. I also think I should add in the positions for the next checkpoints, as I myself need to know where I need to be in the next few corners to get the best line through a turn. I'll start working on adding that and doing some more training.

hughperkins · Feb 2, 2023

oh whoops, I thought this was a different thread lol. Hence my reference to 'tetris' . Let me re-read

hughperkins · Feb 2, 2023

Your first reponse was exactly correct, yes, lol This:

> Currently, I already have observations for the current time-step velocity and angular velocity. I should add some storage of the previous velocities, and pass those along to the AI, correct?

Nrike458 · Feb 2, 2023

Hahaha, no worries! Been there, done that, but hey, that definitely is a good explanation for things! lol

hughperkins · Feb 2, 2023

(sorry for the confusion I'll delete my spurious post)

Nrike458 · Feb 2, 2023

Just to check, would I need to change any config parameters to account for an increase in observations passed to the Agent? I'm thinking that I might try at least 2 frames of "stored" data of velocity, and pass along maybe the next 3-5 checkpoints positions. That would be almost double the amount of observations currently passed to the Agent.

hughperkins · Feb 2, 2023

I guess that you might not need so many future checkpoint positions. Maybe 1 extra is enough? But yeah, you are doubling or tripling the amount of input data each frame, right. How big your network should be is an empirical question really. If you're only passing in a few frames of positions and rotations though, that's only about (3 * 2 * 3 floats) = 12 floats in total (plus a few more for the checkpoints), which is still pretty small. But yeah, you could try different sizes of network, for sure.

Nrike458 · Feb 3, 2023

I might give it a try with where the settings are at for now, with maybe a frame of previous velocities, and maybe 1 or 2 checkpoints, and see how things turn out.

hughperkins · Feb 3, 2023

Sounds great!

Nrike458 · Feb 20, 2023

Hey,

Sorry about the delays on my part. I had a very very busy personal life these past couple weeks, and I'm just getting back into the swing of things. I modified my code to add in observations of the previous velocity of 1 step, and then added in 2 more observations for the next and next next checkpoints. I went and tried a training, but forgot to update my observation size in the editor for my agents. I then adjusted my observation size, but initializing training from a previous run without those added observations throws a runtime error and freezes Unity. I tried a run without initializing things, and that error didn't occur. However, it obviously didn't have any training. Looks like I need to re-train my AI from scratch again using the new code, on the straight, longer straight, then left, then right, then finally the real deal, to get the proper results for the newly added observations. Might be another couple days before I have that training finished.

Thank you again for all of your help!

Nrike458 · Mar 14, 2023

I went through and tried some more training, and I'm pretty happy with the results that I am getting! I think I'm going to rock with them for now.

A quick summary for those who stumble upon this thread in the future with similar issues. Verify your code works properly. Spend a lot of time testing this, or training will pretty much be useless. I know I had to try things again a bunch of times due to bugs. Start small in training, and gradually work your way through more difficult training environments. You'll get the best results starting with some relatively easy, then working your way up. For observations, think about what you would need to get to the objective, and program that. The AI doesn't know what to look for and what it can see, only you can tell it what to look for. Finally, for parameter tuning, check some of the official documentation, and ask around in the forums. You'll be sure to find what you are looking for.

Thanks to everyone who helped me with the issues that I was experiencing! I've learned quite a bit about Machine Learning using ML Agents over these past few months, and can't wait to continue using it in the future!

Search Unity

Unity ID

Useful Searches

Question ML Agents Training Unstable, AI not learning well