Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice
  2. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

Question Hybrid Control (Discrete + Continuous actions)

Discussion in 'ML-Agents' started by TulioMMo, Apr 30, 2021.

  1. TulioMMo

    TulioMMo

    Joined:
    Dec 30, 2020
    Posts:
    29
    Dear all,

    I am currently working on a project where an agent has to perform 5 discrete actions and 2 continuous actions.

    Thankfully, in the latest implementation of Unity ML-Agents, it seems that hybrid control is a possibility, since we can implement discrete and continuous actions simultaneously.

    I am curious to know how this was implemented in Unity ML-Agents. I have found a couple of papers online about hybrid control, such as http://proceedings.mlr.press/v100/neunert20a/neunert20a.pdf), from DeepMind, but I haven´t figure out which method is being applied to Unity ML-Agents.

    Does anyone knows which method for hybrid control is being utilized in Unity ML-Agents? If so, are there any papers I could read to understand more of the method? I am using ml-agents version 0.26.0

    Many thanks!
     
  2. vincentpierre

    vincentpierre

    Joined:
    May 5, 2017
    Posts:
    160
    Which algorithm are you using?
    In the case of PPO, the action probabilities for continuous and discrete actions can be multiplied together to give the joint probability (we assume the discrete and continuous actions are independent).
    In the case of SAC, we use multiple value heads for each discrete action and use the regular continuous SAC action as input to the value and Q networks. In this case the continuous actions are "selected first" and the discrete actions are conditioned on the continuous actions.
    I will ask around for references
     
    TulioMMo likes this.
  3. TulioMMo

    TulioMMo

    Joined:
    Dec 30, 2020
    Posts:
    29
    Thank you for the reply! I am using the PPO algorithm.

    Regarding the joint probabilities, how can I specify which discrete and continuous actions are multiplied against each other?
     
  4. vincentpierre

    vincentpierre

    Joined:
    May 5, 2017
    Posts:
    160
    The SAC implementation is inspired from this : https://arxiv.org/pdf/1912.11077.pdf

    I do not understand what you mean by "specify which discrete and continuous actions are multiplied against each other". All action probabilities are multiplied together. To update PPO, you need \pi(a | s) = \pi(a_continuous | s) * \pi(a_discrete | s)
     
    TulioMMo likes this.
  5. TulioMMo

    TulioMMo

    Joined:
    Dec 30, 2020
    Posts:
    29
    Thank you for the paper! I understood that Unity right now multiplies all policies of actions (discrete and continuous) happening in the same time-step. My real issue is that I have a continuous action dependent on a discrete action (sorry for not clarifying this before). I saw that I have the option of conditioning a discrete action on a state (masking), but not conditioning a continuous action on a discrete action or state... Would that be possible? Thank you for all the help! :)
     
  6. vincentpierre

    vincentpierre

    Joined:
    May 5, 2017
    Posts:
    160
    Currently no possible, it would require some trainer code changes and I am not sure it will work as we would expect. You could condition on the previous discrete action by feeding the last action as observation, but it will not be the same as taking an action "simultaneously". When you say "conditioning a continuous action on a state", it is already the case, the action depends on the state/observations provided to the agent already, so I am not sure what you mean here. What is the use case for having a continuous action depend on a discrete action (just curious)? It might be enough to have both discrete and continuous actions conditioned on the state for your use case.
     
  7. TulioMMo

    TulioMMo

    Joined:
    Dec 30, 2020
    Posts:
    29
    I am developing a research project that simulates the operation of hydraulic bulb turbines under ocean signal, where the goal is to maximize the total energy attained by the system. The turbines can be set to several "modes" of operation (Power Generation, Offline, Pump Mode, Idling Mode). While setting the modes of operation correspond to discrete actions, the continuous action would be inputting power Pin (with possible values within [0, MaxPin]), when in "Pump Mode" (in any other mode of operation it would make sense that Pin = 0). That´s why I though of conditioning a continuous action on a discrete action or state. I also tried performing hybrid control by penalizing the agent when Pin != 0 in modes different than pump mode, or just setting Pin = 0 independent of the agent's "Pin output" when in these modes, but in this scenario the agent would either choose Pin = 0, or Pin = MaxPin.

    I have obtained some success by changing Pin output to discrete action (i.e. discretizing the range [0, MaxPin]). With this I can use action masking and the agent is utilizing the pump... But still, I think a continuous input for the pump would be ideal.
     
  8. vincentpierre

    vincentpierre

    Joined:
    May 5, 2017
    Posts:
    160
    For continuous actions , it will be impossible for the agent to select the action 0 exactly. This is because continuous control will sample an action form a mean and a standard deviation and the entropy regularization prevents the standard deviation to be 0. I believe it is fine to ignore the MaxPin continuous action of the agent when a mode other than Pump is selected. Even if you were to condition the continuous action on discrete action, the continuous action will never be exactly 0. I am not sure this project is best solved with RL, maybe a planner or a supervised learning approach would give better results?
     
  9. TulioMMo

    TulioMMo

    Joined:
    Dec 30, 2020
    Posts:
    29
    Hmm, maybe I am doing something wrong... Right now I am using

    powerInputPumping = Mathf.Clamp(actionBuffers.ContinuousActions[0], 0f, 1f) * PinMax

    as the continuous action output for the agent. Using Debug.Log() during training I see (exact) output values of "0", "PinMax" and between. After training the agent outputs either "0" or "PinMax" values, exactly.

    From what I understood actionBuffers.ContinuousActions[0] outputs values between [-1, 1], so maybe Mathf.Clamp is allowing for exact values of extremes?

    Thanks again for all the help :)
     
  10. vincentpierre

    vincentpierre

    Joined:
    May 5, 2017
    Posts:
    160
    If a continuous action is clamped, it is very hard for the agent to learn that exact threshold. You are kind of discretizing the continuous action with this Clamp (it is equivalent to a max and it introduces a discontinuity/threshold in the continuous action). That makes it rather hard to learn because if the agent selects -0.1 or -0.9 there will be no difference and the agent will not get a strong learning signal. As an alternative, I would try
    powerInputPumping = (actionBuffers.ContinuousActions[0] + 1f) / 2f * PinMax
    and set powerInputPumping to zero if the discrete action requires it.
    I am not sure it will learn better, but I think it is worth trying out.
     
  11. TulioMMo

    TulioMMo

    Joined:
    Dec 30, 2020
    Posts:
    29
    Thank you very much!! I'll try one more time with your suggestion
     
  12. gvkcps

    gvkcps

    Joined:
    May 27, 2021
    Posts:
    1
    How can we multiply the probabilities of the discrete and continuous actions? The discrete action can be assigned an actual probability (i.e., prob. mass fcn.), but in the case of the continuous action we can only calculate its probability density function. Do you mean we could simply multiply the discrete PMF by the continuous PDF? Isn't this wrong from a theoretical viewpoint?

    Thanks for this thread, it was quite helpful.
     
    TulioMMo likes this.
  13. vincentpierre

    vincentpierre

    Joined:
    May 5, 2017
    Posts:
    160
    You are right that probabilities are not calculated the same, but PPO does a ratio of current policy over old policy. The maximization roughly becomes

    (current_policy_continuous_probability * current_policy_discrete_probability) / (old_policy_continuous_probability * old_policy_discrete_probability) * advantage
    = (current_policy_continuous_probability / old_policy_continuous_probability) / (current_policy_discrete_probability * old_policy_discrete_probability) * advantage
    Because of this ratio, it becomes reasonable to do this multiplication. I hope this makes sense.
     
    TulioMMo likes this.
  14. TulioMMo

    TulioMMo

    Joined:
    Dec 30, 2020
    Posts:
    29
    Sorry for resuscitating this post after so long.

    I wanted to know if the final layer for the actor neural nework in PPO follows the same logic as SAC (below), where we have both a softmax and the moments of the gaussian distribution. I am also wondering if the implementation is straightforward as multi-output models for combined classification and regression (https://machinelearningmastery.com/neural-network-models-for-combined-classification-and-regression/).

    I am asking this since I recently found another paper, where for Hybrid-PPO, 2 independent actor neural networks (one for discrete and the other for continuous action) are used (https://www.ijcai.org/proceedings/2019/0316.pdf), sharing only the first few layers to encode the state information.

    Many thanks!
    upload_2022-6-2_10-53-35.png
     

    Attached Files:

  15. TulioMMo

    TulioMMo

    Joined:
    Dec 30, 2020
    Posts:
    29
    Well, I think I got it:

    Since the probability ratios of continuous and discrete actions are being multiplied against each other, that is probably being done inside the loss function. Since the gradient is obtained by differentiating the loss function, then both discrete and continuous actions are being parametrised by the same weights, i.e. they are outputted by the same neural network.

    I guess this could also be done with two neural networks (one for continuous and one for discrete actions), but then you would need partial derivatives with respect to each neural network weights.
     
    Last edited: Jun 22, 2022