Search Unity

When learning Match3, the award in the built-in Unity ML-Agents algorithm is different from the awar

Discussion in 'ML-Agents' started by i1baranov9, Apr 24, 2021.

  1. i1baranov9

    i1baranov9

    Joined:
    Mar 1, 2021
    Posts:
    4
  2. TreyK-47

    TreyK-47

    Unity Technologies

    Joined:
    Oct 22, 2019
    Posts:
    1,820
    I'll bounce this off the team for some guidance!
     
  3. i1baranov9

    i1baranov9

    Joined:
    Mar 1, 2021
    Posts:
    4
    Thank you! Waiting for an answer.
     
  4. celion_unity

    celion_unity

    Joined:
    Jun 12, 2019
    Posts:
    289
    Hi,
    Are you using the example scene here? I think what's happening is
    1. gym has no concept of "masked" discrete actions. We use these a lot for the match3 integration, because many moves are not valid.
    2. The example scene ignores any moves that are not valid. The code for this is here. Since it ignores the move, no points are awarded.
    If you needed this to work in gym, you'd need to convert the action mask to an observation, and probably need to apply a penalty when trying to make an invalid move; the agent should eventually learn what the action mask observation means and start avoiding those moves, but it will be harder for it to learn.
     
  5. i1baranov9

    i1baranov9

    Joined:
    Mar 1, 2021
    Posts:
    4
    I didn't quite understand where exactly it is necessary to change the action mask.
    Should I just remove the return from the code and give the agent a negative reward instead?
     
  6. celion_unity

    celion_unity

    Joined:
    Jun 12, 2019
    Posts:
    289
    You don't need to change the action mask, but I don't think there's any way for gym to use it, so you should provide it as an observation instead.

    Giving a negative reward instead will still be very hard to learn from, since the agent has to basically learn the rules of what's a valid move, instead of just being told what's valid or not.
     
  7. i1baranov9

    i1baranov9

    Joined:
    Mar 1, 2021
    Posts:
    4
    Good day! I added an action mask to the observation as follows in the Write method of the Match3Sensor class. I also added a negative reward if the player makes the wrong move, but it didn't work. What's my mistake? Am I teaching too little? My timesteps = 300000. Thank you very much for your help!
    Image: https://ibb.co/bv0xBy4

    The code is available here:
    https://pastebin.com/A4WnW7cc