Search Unity

  1. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

Resolved OnActionReceived() called before SetMask()

Discussion in 'ML-Agents' started by KevinOfCathay, Jul 9, 2020.

  1. KevinOfCathay

    KevinOfCathay

    Joined:
    Dec 21, 2019
    Posts:
    7
    I'm making a chess AI through self-playing.
    There are two players A and B, and A plays first.
    When A plays, everything works normally. The program calls SetMask(), and then OnActionReceived().
    Screenshot - 7_9_2020 , 12_55_47 AM.png
    When B plays, it calls OnActionReceived() before SetMask(). (The 3rd and 5th line in the screenshot below)
    Screenshot - 7_9_2020 , 12_56_01 AM.png
    And even weirder, when player B makes his first move, the agent uses player A's mask. After his first move, player B still calls OnActionReceived() before SetMask(), but he uses the mask from the previous round.
    Here is my agent setting:
    Player A:
    Screenshot - 7_9_2020 , 1_07_03 AM.png
    Player B:
    Screenshot - 7_9_2020 , 1_06_53 AM.png
     
    Last edited: Jul 9, 2020
  2. KevinOfCathay

    KevinOfCathay

    Joined:
    Dec 21, 2019
    Posts:
    7
    Well I spent few hours and finally found where the problem is. I should
    WaitForFixedUpdate
    before I request Agent's decision.
     
  3. andrewcoh_unity

    andrewcoh_unity

    Unity Technologies

    Joined:
    Sep 5, 2019
    Posts:
    162
    Glad, you were able to solve this. Out of curiosity, how come you are using stacked observations for a Chess AI?

    In general, I'd be really interested to see how this turns out.
     
  4. KevinOfCathay

    KevinOfCathay

    Joined:
    Dec 21, 2019
    Posts:
    7
    My observation is simply the whole chess board. It's like an image and each pixel represents a single cell of the chess board. The reason the number of observations is 100 instead of 64, is that I'm using a 10*10 chess board :p
    The stacked observation idea comes from AlphaGo's architecture. AlphaGo used 17 stacked observation (which is the current state + past few moves) for training. So here I also used stacked observations, but a relatively small number.
    20071016-02051.png
    This is a screenshot I took from a YouTube video (AlphaGo Zero Tutorial Part 3 - Neural Network Architecture) And I believe this is true.

    When it comes to the training performance, I don't really know if I benefit from this stacked observation. Actually, I tried different settings, like a stack of 3, a stack of 4 etc. But there isn't really that much difference between different stack numbers.
    The real problem for me is that, at a certain point (say after 1 million steps) both players become stagnant, they stop exploring new policies and more often they just repeat a certain (crappy) strategy. So it's hard to tell whether training with stacked observation gives me a better policy.
     
    Last edited: Jul 10, 2020