Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. We have updated the language to the Editor Terms based on feedback from our employees and community. Learn more.
    Dismiss Notice

Trouble with learning via self-play in a simple 1v1 FPS setting

Discussion in 'ML-Agents' started by APDev, May 19, 2020.

  1. APDev

    APDev

    Joined:
    May 19, 2020
    Posts:
    3
    Hello.

    First of all, let me thank the creators of ML-Agents, it appears to be a truly empowering and user-friendly toolkit. However, I couldn't personally make use of that so far, as I've struggled to make my first project work.

    Context: For a project in my master's AI course, I've decided to try training an agent in a simple FPS setting. I took the Pyramids area and modified it as such:

    dcareas.PNG

    The area is inhabited by two agents, who share the same behaviour:
    • they can move forward/still/backward (action 2),
    • right/still/left (action 3),
    • rotate right/still/left (action 4),
    • pull/don't pull the trigger (action 1, shooting is further restricted by the fire rate), and
    • apply/don't apply a precision factor that reduces the move and rotation speeds (action 0).

    dcinspector.PNG

    They use a camera sensor with 108x60 resolution and collect no other observations. The camera also displays a crosshair that changes its colour to red if the agent points towards the other. This is what they (should) see:

    dcsight.PNG

    Throughout the past week, I've tried a number of configurations. In the next comment (due to the max 5 images per comment limit), I will display tensorboard graphs for the following:

    dcconfigs.PNG

    Besides the configuration files, the runs differ in the reward as well. Below, the commented out part of the reward was used by the runs "dcultimate" (obviously not ultimate, though...), "dcg" and "dcsac", while the uncommented part was used by the "dcx" run.

    Per-step reward:
    Code (CSharp):
    1. // Encourage seeking / staying on targets
    2. // if (HasTargetsInSight()) AddReward(1f / MaxStep);
    3.  
    4. // Pull trigger
    5. if (triggerAction == 1)
    6. {
    7.     // Discourage wastefulness
    8.     // AddReward(-1f / MaxStep);
    9.  
    10.     // Shoot
    11.     if (Time.time >= nextTimeToFire)
    12.     {
    13.         nextTimeToFire = Time.time + 1f / fireRate;
    14.         ShootWeapon();
    15.     }
    16. }
    17.  
    18. // Could ELO be falling due to registering -1f/MaxStep as a loss instead of a draw if the episode ends without a victor?
    19. AddReward(0f);
    Final reward:
    Code (CSharp):
    1. public void ResolveHit(DCAgent winnerAgent, DCAgent loserAgent, float stepRatio)
    2. {
    3.     // winnerAgent.AddReward(2f - stepRatio);
    4.     // loserAgent.AddReward(-2f + stepRatio);
    5.     winnerAgent.SetReward(1f);
    6.     loserAgent.SetReward(-1f);
    7.     winnerAgent.EndEpisode();
    8.     loserAgent.EndEpisode();
    9. }
    For reference, this is how my demo's meta data looks:

    dcdemo.PNG

    I realise that, in encouraging "target in sight" behaviour, as well as other per-step rewards/penalties, I'm imposing a bias on the agent, but in this case, I thought that it was necessary: after millions of steps during training, the agents still seemed not to recognise each other. However, this was still the case even after adding this stimulant, hence my persisting problem...

    Part 1/2
     
    Last edited: May 19, 2020
  2. APDev

    APDev

    Joined:
    May 19, 2020
    Posts:
    3
    PPO graphs:
    dcgraphs_0.PNG dcgraphs_1.PNG

    SAC graphs:
    dcgraphs_2.PNG dcgraphs_3.PNG

    ELO graph:
    dcgraphs_4.PNG

    Notes:
    • I don't have a Nvidia GPU, so I'm forced to train on CPU, which is why I've never reached close to the total 50M steps. The longest run took me 2 days.
    • Not sure what to think about SAC, it was hard to set up in the first place (it would regularly freeze the simulation for many minutes).

    I've had a few thoughts about what I could be doing wrong:
    • Are my configurations or rewards simply inappropriate? Maybe some other mistake that I've missed?
    • Should I just have more faith that the training will eventually start going in the right direction? The thing is, I haven't seen a glimpse of optimistic behaviour even after 6M steps...
    • Is the environment too complex and should be made more simple?
    • Is the problem itself too hard? Maybe the win condition (see the target and shoot) is too close to the loss condition (be seen by the target and shot).
    Anyway, I've thought that it was about time that I consult some people more knowledgeable and experienced than me.

    I thank you all in advance for your time and thoughts.

    Part 2/2
     
    Last edited: May 19, 2020
  3. awjuliani

    awjuliani

    Unity Technologies

    Joined:
    Mar 1, 2017
    Posts:
    69
    Hi APDev,

    Learning from raw images can take much more data than learning from vector observations. As such, 6 million steps from eight concurrent agents may be too few. It is also the case that more dense rewards will help training, as long as they are properly formatted. On top of this, you have an adversarial condition, which makes training more complex. I would perhaps try getting a simple single player version working with dense rewards, and work your way toward the more complex behavior.
     
    APDev likes this.
  4. APDev

    APDev

    Joined:
    May 19, 2020
    Posts:
    3