Search Unity

Validation of what I learned so far;

Discussion in 'ML-Agents' started by Soulkata, Aug 23, 2022.

  1. Soulkata

    Soulkata

    Joined:
    Nov 15, 2015
    Posts:
    6
    I am studding ML-Agents and making an F1 race themed game...

    In my training, I always start with 20 cars, and go inactivating them as they make dumb choices, like colliding of going too off track... But, as soon I end one episode, the framework start a new one, and to me is a problem, because I want to start the race with all 20 cars... So, each car have 2 separate objects, one for the car itself, and one for the car agent... So, I only disable the car itself, and stop calling for decisions and actions from the agent... Once all the cars have finished, I proper end session... I could just disable the car, but the episode end reason will be always disabled, and that is not true... sometimes, when only couple car remains, I find more useful tell that max step reached instead waiting for a long time to it finish...
    1) The above logic is right?

    From my experience, if I try explain too much things at same time to the agent, it wont learn as fast... So I divided my training into 3 steps
    * All cars are virtual, and they only have to do the racetrack; Note that cars see each other, but they don't collide yet;
    * Cars are physical, they collide against each other, and they need to complete a lap without colliding with no one... (Takes considerably more time than step one)
    * The cars are physical and they are competing with each other.
    2) This separated method is better / recommended?

    Is very tricky make one car want to overtake other car, because is risky process, and one collision will just take both out of the process... And since one car gains bonus and the other loses the bonus, the overtaking process isn't easily interpreted by a standard PPO scheme. Instead I opted to use self-play, but self-play by definition is used on 1 vs 1 games, like chess... So what I ended doing is the following: I stablish a nemesis for each car, on a grid forming 10 nemesis pairs, and at the end of each episode I compare each pair score (that are collected in a separated place), and assign victory to the one with more points; I know this isn't ideal, because like, the first one can't overtake no one, but this is the better that I could figure out...
    3) Is this acceptable / recommended?
     
  2. i-make-robots

    i-make-robots

    Joined:
    Aug 27, 2017
    Posts:
    17
    I would not disable cars that are learning. Let the student make mistakes and get a bad score, don't suspend them from school.

    I would add
    - a negative reward when a car touches anything except the road,
    - extra points for every meter of the course they complete, and maybe
    - extra points based on their rank at any given moment in the race, maybe 1-(rank/20).
     
    Soulkata likes this.
  3. Soulkata

    Soulkata

    Joined:
    Nov 15, 2015
    Posts:
    6
    I have a "training track", that have samples of things that the agents might encounter on the races, in a increasing difficulty, like, sharp corners, yellow flags, pit lane entrances etc... To walk around this track usually takes too many steps, like 6k... And, a F1 car, when hit something, it breaks, can't ride anymore!
    So I opted to end the car session early to save time...

    On this scenario, where cars can kill themselves, is neves a good idea to give punishment's, because them can choose to kill them selfs, or stay idle rather than continue running and take the penalty... I only five a fixed very small time penalty each decision, to force them to go forward, and everything else are rewards.
    Penalty are given in 2 ways:
    * Lack of a possible reward;
    * End episode (this way they cant get more rewards)

    Already done, but only when on track (not on grass), and when belly speed limit on boxes!

    This isn't a good thing, because PPO tent to use the higher rewards as a guideline, so, even if a car behind makes things better than the first ones the PPO would still discard-it and say that he drive bad...
     
    i-make-robots likes this.
  4. i-make-robots

    i-make-robots

    Joined:
    Aug 27, 2017
    Posts:
    17
    I didn't know PPO reacted so ...negatively... to negative rewards. interesting!