Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.

is there full-walkthrough or tutorial for custom python algorithm?

Discussion in 'ML-Agents' started by schrodinger8834, Dec 2, 2020.

  1. schrodinger8834

    schrodinger8834

    Joined:
    Oct 13, 2020
    Posts:
    10
    hello, i want to apply my custom python RL algorithm to Unity Environment that is also made by me.

    i want to learn how to make my own python algorithm for ml agents and how to apply it to custom unity environment.

    i tried to find any good tutorial for that, but seems like there aren't any good tutorial for up-to-date version of ml-agents.

    it will be really thankful if i get any advice or tutorial links like that
     
  2. vincentpierre

    vincentpierre

    Unity Technologies

    Joined:
    May 5, 2017
    Posts:
    160
    schrodinger8834 likes this.
  3. schrodinger8834

    schrodinger8834

    Joined:
    Oct 13, 2020
    Posts:
    10
    really appreciate for advice.

    well actually what i want to know that is, how should i change my own python algorithm when there are multiple training prefabs in Unity.

    i saw Unity's docs which describe that i'm able to duplicate training object prefab, which is called parallel training, but it describes only when you use built-in algorithm on ml-agents.

    could you give me any tutorials for this matter or advice?
     
  4. vincentpierre

    vincentpierre

    Unity Technologies

    Joined:
    May 5, 2017
    Posts:
    160
    In the gridworld environment this example uses, there are 9 "prefabs" corresponding to 9 agents. This is illustrated by the number of elements in the DecisionSteps and TerminalSteps being variable. The Trainer class in the tutorial deals with multiple Agents. To setup your Unity scene with multiple prefab Agents, you do not need to do anything different than what the provided algorithms require, having multiple Agents with the same Behavior Name will cause all of them to send and receive data to and from Python.
    The three colab notebooks I sent are the only ones available at this moment to learn how to use the Low Level Python API. Do you have specific questions on how this API works ?
     
  5. schrodinger8834

    schrodinger8834

    Joined:
    Oct 13, 2020
    Posts:
    10
    well, first of all, i really want to say thank you all of those kinda tutorials that i need.

    i don't know why i couldn't find this one, actually i tried to find any kind of this tuts pretty hard.

    and your colab codes are really helpful.

    here's some questions.

    1. how should i change code if Unity Environment has '5 decision period' like in your
    ML-Agents Q-Learning with GridWorld ?

    2. to be related above question, should i widen(increase) decision period if my unity environment need to calculate many physical things(e.g collision ...)?
     
  6. vincentpierre

    vincentpierre

    Unity Technologies

    Joined:
    May 5, 2017
    Posts:
    160
    1. If the decision period is changed for all Agents, nothing should change. Unity would simply send messages less frequently but Python will receive the observations in a batch. If the period changes differently for different Agents, then you will receive observations when Agents request decisions (make sure to look at what the AgentIds are in the DecisionSteps!) The code in the Colab should be able to handle a different decision period if my memory is correct.

    2. Picking the right decision period can be challenging. If the period is too small, each decision the algorithm makes will have "less" importance and it might be a lot harder for the Agent to assign credit to good and bad decisions. On the other hand, if the period is too large, the Agent might not be able to control the game effectively since the actions would be very sticky. It depends on the game, I usually set this value to be around 5 when physics is important. Unity does a fixed update 50 times per second (every 0.02 seconds) unless specified otherwise. So a period of 5 corresponds to 10 decisions per second (every 0.1 seconds).
     
  7. schrodinger8834

    schrodinger8834

    Joined:
    Oct 13, 2020
    Posts:
    10

    + i tried to modify my code, and could run it with unity but somehow

    after few random episodes,

    below error is occurred.

    https://imgur.com/a/QqPt8si

    and here's my code also if can help.

    https://github.com/schrodinger8834/custom_ddpg_mlagents/blob/main/test_ddpg.py

    can i find something other algorithm like DDPG which i'm trying now?

    because i got stuck here for few weeks..

    can i get some advice for my code and above error?
     
  8. vincentpierre

    vincentpierre

    Unity Technologies

    Joined:
    May 5, 2017
    Posts:
    160
    This error occurs because the number of decisions requested and the number of decisions sent did not match. Unity was expecting 0 decisions but 9 were sent. The reason 0 decisions were requested is most likely because some agent(s) terminated in between decisions : You had 0 DecisionSteps but some TerminalSteps. You need to make sure that the shape of the action you send is always (number_of_agents, decision_size) and number_of_agents can be 0 if no decisions were requested last step.
     
  9. schrodinger8834

    schrodinger8834

    Joined:
    Oct 13, 2020
    Posts:
    10
    so should i input trajectory which has DecisionSteps > 0 into replay memory and ignore when got 0 DecisionSteps and some TerminalSteps?

    reason why i think like this is if code like :

    Code (CSharp):
    1. # Main
    2. if __name__ == '__main__':
    3.     # set env
    4.     env = UnityEnvironment(file_name=None, seed=1, side_channels=[])
    5.  
    6.     # channel.set_configuration_parameters()
    7.  
    8.     env.reset()
    9.  
    10.     print('env.behavior_specs : {}\n'.format(list(env.behavior_specs)))
    11.  
    12.     behavior_name = list(env.behavior_specs)[0]
    13.     decision_steps, terminal_steps = env.get_steps(behavior_name)
    14.     spec = env.behavior_specs[behavior_name]
    15.  
    16.     tracked_agent = -1
    17.  
    18.     # DDPGAgent
    19.     agent = DDPGAgent()
    20.     rewards = deque(maxlen=print_interval)
    21.     success_cnt = 0
    22.     step = 0
    23.  
    24.     myobs = deque()
    25.     myaction = deque()
    26.     myreward = deque()
    27.     myobs2 = deque()
    28.     mydone = deque()
    29.  
    30.     # save into replay memory
    31.     for episode in range(run_episode + test_episode):
    32.         if episode == run_episode:
    33.             train_mode = False
    34.  
    35.         if tracked_agent == -1 and len(decision_steps) >= 1:
    36.             tracked_agent = decision_steps.agent_id[0]
    37.  
    38.         env.reset()
    39.         decision_steps, terminal_steps = env.get_steps(behavior_name)
    40.  
    41.         state = decision_steps.obs[0]
    42.  
    43.         episode_rewards = 0
    44.         done = False
    45.  
    46.         while not done:
    47.             step += 1
    48.  
    49.             action = agent.get_action(state)
    50.  
    51.             env.set_actions(behavior_name, action)
    52.             env.step()
    53.  
    54.             decision_steps, terminal_steps = env.get_steps(behavior_name)
    55.  
    56.             myobs.append(state)
    57.             myaction.append(action)
    58.  
    59.             for agent_id_terminated in terminal_steps:
    60.  
    61.                 myreward.append(terminal_steps[agent_id_terminated].reward)
    62.  
    63.                 myobs2.append(terminal_steps[agent_id_terminated].obs[0])
    64.                 mydone.append(not terminal_steps[agent_id_terminated].interrupted)
    65.  
    66.                 if tracked_agent == agent_id_terminated:
    67.                     episode_rewards += terminal_steps[tracked_agent].reward                
    68.                     done = True
    69.  
    70.             for agent_id_decisions in decision_steps:
    71.  
    72.                 myreward.append(decision_steps[agent_id_decisions].reward)
    73.  
    74.                 myobs2.append(decision_steps[agent_id_decisions].obs[0])
    75.                 mydone.append(False)
    76.  
    77.                 if tracked_agent == agent_id_decisions:
    78.                     episode_rewards += decision_steps[tracked_agent].reward
    79.  
    80.             # state <- next_state to generate new trajectory
    81.             state = (np.asarray(decision_steps.obs[0])).reshape((-1, 9))
    82.  
    83.             if train_mode:
    84.                
    85.                 myobs_ = (np.asarray(myobs)).reshape((-1, 9))
    86.                 myaction_ = (np.asarray(myaction)).reshape((-1, 3))
    87.                 myreward_ = (np.asarray(myreward)).reshape((-1, 1))
    88.                 myobs2_ = (np.asarray(myobs2)).reshape((-1, 9))
    89.                 mydone_ = (np.asarray(mydone)).reshape((-1, 1))
    90.  
    91.                 agent.append_sample(myobs_, myaction_, myreward_, myobs2_, mydone_)
    92.  
    93.                 myobs.clear()
    94.                 myaction.clear()
    95.                 myreward.clear()
    96.                 myobs2.clear()
    97.                 mydone.clear()
    then decision_steps.obs[0] must be empty list in

    state = (np.asarray(decision_steps.obs[0])).reshape((-1, 9)) here.

    so to be short,

    1. should i ignore generating specific trajectory when there are NOT decision_steps.obs[0] at some timestep?

    2. if not, how should i handle this problem?
     
  10. schrodinger8834

    schrodinger8834

    Joined:
    Oct 13, 2020
    Posts:
    10
    +edit) please check this image also.

    https://imgur.com/a/V6EYhGb
     
  11. vincentpierre

    vincentpierre

    Unity Technologies

    Joined:
    May 5, 2017
    Posts:
    160
    If no agents requested a decision, then they do not need an action.
    I think storing in myaction the whole batch of actions (for 9 agents) is a terrible idea because the order of the AgentIds is not guaranteed. I think the best way to store trajectories is to have individual trajectories for each Agent (like what is done here : https://colab.research.google.com/d...GnynYdL_LRsHG#forceEdit=true&sandboxMode=true) This way your code will be a lot more robust to terminal steps and to Agents skipping decisions.
    If you really want to store trajectories as a batched trajectories (which I do not recommend). You will have to only store the actions taken when all 9 agents requested a decision while somehow managing the observations and reward received in the terminal steps.

    In ML-Agents, Agents can request decisions and terminate anytime. This is to allow more flexibility to the C# developer. This means that the data received by Python is unordered and does not have guarantees to always have the same number of Agents (and can even have 0 agents requesting decisions during a step to allow C# to signal an Agent terminated). This means that Python must keep track of individual Agents when storing trajectories.
     
  12. schrodinger8834

    schrodinger8834

    Joined:
    Oct 13, 2020
    Posts:
    10
    thank you and just finished implementing,

    is there any possibility of occurring "KeyError"
    in "generate_trajectories" function? because that error occurs after few episodes randomly.

    specially, at "
    obs=dict_last_obs_from_agent[agent_id_terminated].copy()" in
    for agent_id_terminated in terminal_steps:
    loop.
     
  13. schrodinger8834

    schrodinger8834

    Joined:
    Oct 13, 2020
    Posts:
    10
    to be more specific, after few random episodes,

    Code (CSharp):
    1. def generate_trajectories(self, env, buffer_size):
    2.        
    3.         # Create an empty Buffer
    4.         buffer: Buffer = []
    5.  
    6.         env.reset()
    7.  
    8.         behavior_name = list(env.behavior_specs)[0]
    9.         spec = env.behavior_specs[behavior_name]      
    10.  
    11.         # Create a Mapping from AgentId to Trajectories.
    12.         # This will help us create trajectories for each Agents.
    13.         dict_trajectories_from_agent: Dict[int, Trajectory] = {}
    14.         dict_last_obs_from_agent: Dict[int, np.ndarray] = {}
    15.         dict_last_action_from_agent: Dict[int, np.ndarray] = {}
    16.         # Only for reporting
    17.         dict_cumulative_reward_from_agent: Dict[int, float] = {}
    18.  
    19.         cumulative_rewards: List[float] = []
    20.  
    21.         # while not enough data in the buffer
    22.         while len(buffer) < buffer_size:
    23.            
    24.             # Get the Decision Steps and Terminal Steps of the Agents
    25.             decision_steps, terminal_steps = env.get_steps(behavior_name)
    26.      
    27.             # For all Agents with a Terminal Step:
    28.             for agent_id_terminated in terminal_steps:
    29.                 # Create its last experience (is last because the Agent terminated)
    30.                 last_experience = Experience(
    31.                     obs=dict_last_obs_from_agent[agent_id_terminated].copy(),
    32.                     # obs=dict_last_obs_from_agent[agent_id_terminated].copy(),
    33.                     reward=terminal_steps[agent_id_terminated].reward,
    34.                     done=not terminal_steps[agent_id_terminated].interrupted,
    35.                     action=dict_last_action_from_agent[agent_id_terminated].copy(),
    36.                     next_obs=terminal_steps[agent_id_terminated].obs[0]
    37.                 )
    at the beginning of generate_trajectories and first loop,

    decision_steps has [0, 1, 2, 3, 4, 5, 6, 7, 8] and terminal_steps has [3] while dict_last_obs_from_agent is initialized with empty dictionary.

    so KeyError is occurred at obs=dict_last_obs_from_agent[agent_id_terminated].copy() because dict_last_obs_from_agent is empty.

    seems like there wouldn't be any problem if there is only decision_steps and empty terminal_steps at very first loop but i got KeyError because there is also terminal_steps at very first loop.

    how can i solve this problem?
     
  14. vincentpierre

    vincentpierre

    Unity Technologies

    Joined:
    May 5, 2017
    Posts:
    160
    Does this happen with GridWorld? I never observed that before and was able to train GridWorld correctly. I think your guess is right, there is something in the environment that terminates an Agent right after the first decision (which should not happen in GridWorld). I would recommend ignoring it (if agent_id_terminated not in dict_last_obs_from_agent do nothing).
     
  15. schrodinger8834

    schrodinger8834

    Joined:
    Oct 13, 2020
    Posts:
    10
    how can i deal with "ignoring" in the code?

    as i know so far, when agent_id is in decision_steps, it needs "action" to proceed which means doing nothing couldn't step() agent.

    could you explain more specifically in terms of writing code? i have no idea how to "do nothing" in code.

    (and, this happens in my own environment not in GridWorld)

    edit) i just added simply :

    if agent_id_terminated not in dict_last_obs_from_agent:
    break

    this one just after

    decision_steps, terminal_steps = env.get_steps(behavior_name)

    in while len(buffer) < buffer_size loop.

    is there better option?
     
    Last edited: Dec 10, 2020