Search Unity

  1. We are migrating the Unity Forums to Unity Discussions by the end of July. Read our announcement for more information and let us know if you have any questions.
    Dismiss Notice
  2. Dismiss Notice

Question Turn-based RPG Card Game, different monsters.

Discussion in 'ML-Agents' started by Denaton, Dec 30, 2020.

  1. Denaton

    Denaton

    Joined:
    Apr 6, 2011
    Posts:
    16
    Since summer i have been trying to train an AI to play the enemies in my game.
    To understand how the combat in my game works take look here, its easier to show then to explain, but basically "what if Zelda was a card game".

    The battlefield is a grid layout and they have their own decks with different cards depending on their type.
    This makes so they don't have the same chance of winning, larva basically can only push others wile Goblins can shoot with a bow.
    I have been using this agent file
    This is my Agent script, its not very clean since i have tried allot of different settings and right now i trying to hard set the reward at 1,0,-1. (I have a setup script where i set the winners and give a reward and loop it around)

    The Cumulative Reward goes up and then peak and when it peaks the AI still play random cards..
     
  2. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    Your rewards are too sparse. You need to add small rewards for correct actions in context. For this game I would add a small reward every time the agent plays a move card when not near the enemy (probably even a bigger one if the movement causes them to get closer instead of further away) and another small reward when the agent does an attack that lands and deals damage. You can experiment with negative rewards for doing the opposite but be careful not to over punish or the agent will become suicidal.
     
  3. Denaton

    Denaton

    Joined:
    Apr 6, 2011
    Posts:
    16
    I did this and trained it for about 4 days non-stop, still dident work. I just added the bigger reward today after i read on the wiki "we encourage users to begin with the simplest possible reward function (+1 winning, -1 losing)"

    In the code i shared i have comment out rewards that gives 0.1 and -0.1 from getting damage and dealing damage and +1 if you killed someone +0.5 if a teammate killed someone, -1 if you died and -0.5 if a teammate died.
    I have mixed with the values for a half year now, both setting them high and low and disable negative once and activated them again.
    I run the training for about 3-4 days non-stop (restarting the editor every mooring and night due to small memory leak after +8h running) and when the Cumulative Reward curve flattens they still seems really random and just pick cards at random.

    I also have a reward on the movement card if they move closer (and some cases that might be a dumb thing to do since ex Skeleton Mage only have spell attacks while Water Snake as an strong melee attack, skeleton mage should take the Water Snake with ease but dies due to running into the snake to get the reward value higher before dyeing).
     
    Luke-Houlihan likes this.
  4. ervteng_unity

    ervteng_unity

    Unity Technologies

    Joined:
    Dec 6, 2018
    Posts:
    150
    Hi Denaton, have you tried training the AI of just one unit type? The agent might be having issues learning a policy that works for all enemy types, and you might need to train a separate Behavior for each one.
    Also it'd be super helpful to post some of your TensorBoard curves, maybe it's a simple issue like having to increase the size of the neural network.
    Related: we're working on a feature that should help with varying numbers of agents/cards in the observation space. Stay tuned!
     
    Denaton likes this.
  5. Denaton

    Denaton

    Joined:
    Apr 6, 2011
    Posts:
    16
    When i started training i first made a special enemy calld ML-Agent, he had some basic cards like move, turn and strike, they also trained with added additional random cards, same outcome.

    This is the latest, i have had so many runs that i have deleted the old once...
    upload_2020-12-30_19-3-17.png

    This is the current one that i am running now with the rewards set in those i shared..
    upload_2020-12-30_19-7-3.png

    I had this one a few weeks ago too, wanted to share it because it looks funny, no clue what happened here..
    upload_2020-12-30_19-5-15.png
     
  6. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    Ah good point, I hadn't noticed the commented out rewards.

    I took a deeper look into your agent script and I think your training issue may come down to shifting in both your observation and action space. Please correct me if I misunderstood any of your code below, my linkq method syntax experience is a few years behind me and the statements can be a little dense.

    Action Space -
    The logic here appears to be 1) Get a list of the cards that can be played. 2) If there are playable priority cards only these are added to the list, otherwise every playable card is added. 3) If the first vector action isn't 0 (I'm assuming thats for no action) and there's 1 or more playable cards, scale the vector action to the size of available cards (?? I'm not sure I understand this part, ln 117 gets the product of vectorAction[0] and available cards which for any vector action value other than 1 will be out of the array bounds. If the action vector value is always 1 then the agent can only ever select to play the last card ??). 4) That card is played.

    This is problematic because the played card has a very loose relevance to the value being returned by the policy. For example the agent can choose vectorAction[0] = 1 at timestep 5 and "Momentum" gets played moving the agent, then at timestep 6 it may choose vectorAction[0] = 1 again and get "Turn Right". Now given enough timesteps (potentially on the order of hundreds of millions) the agent should pick up the connection between the observed hand and the played card.

    Observation Space -
    I believe you're inputing all the necessary values to observe in the state, the shifting happens when you gather a list of the local units and order them by proximity to the agent. Units will shift in this list as some get close to the agent and others get further away. If there are more than 10 units, new entities entirely could be introduced as they move closer.

    Again given enough time an agent can probably learn to adjust to these values being ephemeral over the episode.

    It was very clever to take the card description/unit names and feed it in as a unique int value btw.

    Either one of those shifting states could probably be overcome alone in training however the more you add, the more abstract relationships the policy will have to infer, which requires a more complicated network and many more training iterations.

    Hope this helps.
     
  7. Denaton

    Denaton

    Joined:
    Apr 6, 2011
    Posts:
    16
    Thanks allot for the feedback, do you thin the policies will be better if i have a fixed hand size to select from in the action space? I have limit the hand to 15 in the observation so maybe to should use 15 as a default size to select?
    When i made the card selection i made it so below 0 you skip your turn (even if you can play more) and above 0f to 1f (Continuous) is the percentage of what card you want to play. They have never had more then 10 cards in their hand at the same time though so the limit overflow handling on that observation has never effected them, yet..
    But you have me a few new ideas i can try..

    I am really new to machine learning so i have no clue what i am doing..
    Also, sorry for the messy code, will refactor it when it works...
     
  8. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    Absolutely, in fact I don't think you should be using continuous actions at all, picking a card from a subset of all cards is a discrete problem space because we know about all possible cards. And there's no logical way to play half one card half another, each card is indivisible. I would recommend trying a discrete branch with a size of all cards. Then just use the action masker to mask (remove as an option to the agent) all cards that aren't in the agents hand.

    This way each action chosen by the agent maps directly to an unchanging (no shifts!) card played. Now the agent can "focus" on learning the relationship between the observed hand and the available options with no shifting action values. The one downside is that this will only allow 1 card selection per turn, but it's a start.

    That makes more sense than my previous understanding, for some reason I thought you were using discrete actions and the fact that you were using continuous actions didn't occur to me.

    Well you're doing great! The problem you're solving here is an interesting one and definitely not trivial. The only advice I'd give you is to try to put yourself in the agents position and see if you could learn when designing things like the action space. Strip away the card game part (which you intrinsically understand from previous experience) and pretend I just handed you the action space and said "Pick". You'd probably pick something like 0.5 and I would turn your character left. Next turn you select 0.5 again and I teleport you 3 spaces backwards. Next turn you choose 0.1 and I turn your character left again. You'd probably be pretty confused and you would be forced to spend the next however many turns taking careful notes to figure out why the numbers don't seem to mean the same thing every turn. This isn't even getting into how the choices are infinite in a continuous space, so what's the difference between 0.002335 and 0.002339? When you pick those two numbers in different turns they produce different results so the choices must be incredibly granular and numerous right? I think you get where I'm going with this.

    It always astounds me that my agents can learn anything at all through some of the complex crap I throw at them :eek:.

    No worries! The code is great, you should see some of the snarls I've written...
     
  9. Denaton

    Denaton

    Joined:
    Apr 6, 2011
    Posts:
    16
    I started to convert to discrete and was reading on Masking Discrete Actions, so i was thinking, instead of have the capped hand size as size in the branch i set the total cards and mask the cards he cant play, but then i need to change the size of the branch every time i add a new card and retrain it right?
    Or is there an other way to do it for the full card list?
     
  10. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    The action space cannot change once training has began, there will be no way to add new cards to a previously trained policy. This is a common unsolved problem in machine learning referred to as "Generalization". This limitation applies to any implementation you attempt because it's a side effect of the training algorithm (or method) and not something that can be addressed in the environment or reward structure.

    For example, lets pretend the implementation you posted worked and a good policy was trained from it. Even though in that case your environment allowed the policy to play a newly added card, the policy would never have any idea (and would never learn, remember we are inferencing now) how the new card worked or when it should play it. Essentially if the new card was ever played it would only be because there is still some randomness in the policy and not for any tactical reason like we would want.

    You will always have to retrain when the action or observation spaces change.
     
    Last edited: Jan 6, 2021
    Denaton likes this.
  11. Denaton

    Denaton

    Joined:
    Apr 6, 2011
    Posts:
    16
    I cant get Discrete work, its only return 0 on all actions.
    I have 3 branches, first one if for what card (set it to 1000), it picks from the card database and mask if its not playable. 0 is End turn, so it always end its turn immediately since its only pick 0 as an action.

    I have debugged and it should be able to pick a card..
    I have removed code and stuff to see if there is something i have done (removed the masking) but its always picking 0 anyway..
    It even pick 0 if i mask 0...

    The new agent

    Edit; Got it working, had to uninstall the mlagents python package and install it again to get the new version, for some reason the update command ident work so i thought i had the latest version.
     
    Last edited: Jan 7, 2021
  12. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    Great! Let me know how training goes.
     
  13. Denaton

    Denaton

    Joined:
    Apr 6, 2011
    Posts:
    16
    Dont know how to read ELO but its diving to the bottom right away.
    I set the teamID at runtime based on what faction they get (i mix the factions ex Gobings and Undead for diversity) but its seems like they are all in the same team anyway? (TeamId 0 as default on all)
    Is it even possible to change TeamID at runtime and still make it work?

    upload_2021-1-8_10-35-49.png
    upload_2021-1-8_10-35-17.png