Question Hybrid Control (Discrete + Continuous actions)

Discussion in 'ML-Agents' started by TulioMMo, Apr 30, 2021.

1. TulioMMo

Joined:
Dec 30, 2020
Posts:
29
Dear all,

I am currently working on a project where an agent has to perform 5 discrete actions and 2 continuous actions.

Thankfully, in the latest implementation of Unity ML-Agents, it seems that hybrid control is a possibility, since we can implement discrete and continuous actions simultaneously.

I am curious to know how this was implemented in Unity ML-Agents. I have found a couple of papers online about hybrid control, such as http://proceedings.mlr.press/v100/neunert20a/neunert20a.pdf), from DeepMind, but I haven´t figure out which method is being applied to Unity ML-Agents.

Does anyone knows which method for hybrid control is being utilized in Unity ML-Agents? If so, are there any papers I could read to understand more of the method? I am using ml-agents version 0.26.0

Many thanks!

Unity Technologies

Joined:
May 5, 2017
Posts:
160
Which algorithm are you using?
In the case of PPO, the action probabilities for continuous and discrete actions can be multiplied together to give the joint probability (we assume the discrete and continuous actions are independent).
In the case of SAC, we use multiple value heads for each discrete action and use the regular continuous SAC action as input to the value and Q networks. In this case the continuous actions are "selected first" and the discrete actions are conditioned on the continuous actions.
I will ask around for references

TulioMMo likes this.
3. TulioMMo

Joined:
Dec 30, 2020
Posts:
29
Thank you for the reply! I am using the PPO algorithm.

Regarding the joint probabilities, how can I specify which discrete and continuous actions are multiplied against each other?

Unity Technologies

Joined:
May 5, 2017
Posts:
160
The SAC implementation is inspired from this : https://arxiv.org/pdf/1912.11077.pdf

I do not understand what you mean by "specify which discrete and continuous actions are multiplied against each other". All action probabilities are multiplied together. To update PPO, you need \pi(a | s) = \pi(a_continuous | s) * \pi(a_discrete | s)

TulioMMo likes this.
5. TulioMMo

Joined:
Dec 30, 2020
Posts:
29
Thank you for the paper! I understood that Unity right now multiplies all policies of actions (discrete and continuous) happening in the same time-step. My real issue is that I have a continuous action dependent on a discrete action (sorry for not clarifying this before). I saw that I have the option of conditioning a discrete action on a state (masking), but not conditioning a continuous action on a discrete action or state... Would that be possible? Thank you for all the help!

Unity Technologies

Joined:
May 5, 2017
Posts:
160
Currently no possible, it would require some trainer code changes and I am not sure it will work as we would expect. You could condition on the previous discrete action by feeding the last action as observation, but it will not be the same as taking an action "simultaneously". When you say "conditioning a continuous action on a state", it is already the case, the action depends on the state/observations provided to the agent already, so I am not sure what you mean here. What is the use case for having a continuous action depend on a discrete action (just curious)? It might be enough to have both discrete and continuous actions conditioned on the state for your use case.

7. TulioMMo

Joined:
Dec 30, 2020
Posts:
29
I am developing a research project that simulates the operation of hydraulic bulb turbines under ocean signal, where the goal is to maximize the total energy attained by the system. The turbines can be set to several "modes" of operation (Power Generation, Offline, Pump Mode, Idling Mode). While setting the modes of operation correspond to discrete actions, the continuous action would be inputting power Pin (with possible values within [0, MaxPin]), when in "Pump Mode" (in any other mode of operation it would make sense that Pin = 0). That´s why I though of conditioning a continuous action on a discrete action or state. I also tried performing hybrid control by penalizing the agent when Pin != 0 in modes different than pump mode, or just setting Pin = 0 independent of the agent's "Pin output" when in these modes, but in this scenario the agent would either choose Pin = 0, or Pin = MaxPin.

I have obtained some success by changing Pin output to discrete action (i.e. discretizing the range [0, MaxPin]). With this I can use action masking and the agent is utilizing the pump... But still, I think a continuous input for the pump would be ideal.

Unity Technologies

Joined:
May 5, 2017
Posts:
160
For continuous actions , it will be impossible for the agent to select the action 0 exactly. This is because continuous control will sample an action form a mean and a standard deviation and the entropy regularization prevents the standard deviation to be 0. I believe it is fine to ignore the MaxPin continuous action of the agent when a mode other than Pump is selected. Even if you were to condition the continuous action on discrete action, the continuous action will never be exactly 0. I am not sure this project is best solved with RL, maybe a planner or a supervised learning approach would give better results?

9. TulioMMo

Joined:
Dec 30, 2020
Posts:
29
Hmm, maybe I am doing something wrong... Right now I am using

powerInputPumping = Mathf.Clamp(actionBuffers.ContinuousActions[0], 0f, 1f) * PinMax

as the continuous action output for the agent. Using Debug.Log() during training I see (exact) output values of "0", "PinMax" and between. After training the agent outputs either "0" or "PinMax" values, exactly.

From what I understood actionBuffers.ContinuousActions[0] outputs values between [-1, 1], so maybe Mathf.Clamp is allowing for exact values of extremes?

Thanks again for all the help

Unity Technologies

Joined:
May 5, 2017
Posts:
160
If a continuous action is clamped, it is very hard for the agent to learn that exact threshold. You are kind of discretizing the continuous action with this Clamp (it is equivalent to a max and it introduces a discontinuity/threshold in the continuous action). That makes it rather hard to learn because if the agent selects -0.1 or -0.9 there will be no difference and the agent will not get a strong learning signal. As an alternative, I would try
powerInputPumping = (actionBuffers.ContinuousActions[0] + 1f) / 2f * PinMax
and set powerInputPumping to zero if the discrete action requires it.
I am not sure it will learn better, but I think it is worth trying out.

11. TulioMMo

Joined:
Dec 30, 2020
Posts:
29
Thank you very much!! I'll try one more time with your suggestion

12. gvkcps

Joined:
May 27, 2021
Posts:
1
How can we multiply the probabilities of the discrete and continuous actions? The discrete action can be assigned an actual probability (i.e., prob. mass fcn.), but in the case of the continuous action we can only calculate its probability density function. Do you mean we could simply multiply the discrete PMF by the continuous PDF? Isn't this wrong from a theoretical viewpoint?

TulioMMo likes this.

Unity Technologies

Joined:
May 5, 2017
Posts:
160
You are right that probabilities are not calculated the same, but PPO does a ratio of current policy over old policy. The maximization roughly becomes

(current_policy_continuous_probability * current_policy_discrete_probability) / (old_policy_continuous_probability * old_policy_discrete_probability) * advantage
= (current_policy_continuous_probability / old_policy_continuous_probability) / (current_policy_discrete_probability * old_policy_discrete_probability) * advantage
Because of this ratio, it becomes reasonable to do this multiplication. I hope this makes sense.

TulioMMo likes this.
14. TulioMMo

Joined:
Dec 30, 2020
Posts:
29
Sorry for resuscitating this post after so long.

I wanted to know if the final layer for the actor neural nework in PPO follows the same logic as SAC (below), where we have both a softmax and the moments of the gaussian distribution. I am also wondering if the implementation is straightforward as multi-output models for combined classification and regression (https://machinelearningmastery.com/neural-network-models-for-combined-classification-and-regression/).

I am asking this since I recently found another paper, where for Hybrid-PPO, 2 independent actor neural networks (one for discrete and the other for continuous action) are used (https://www.ijcai.org/proceedings/2019/0316.pdf), sharing only the first few layers to encode the state information.

Many thanks!

File size:
48.6 KB
Views:
118
15. TulioMMo

Joined:
Dec 30, 2020
Posts:
29
Well, I think I got it:

Since the probability ratios of continuous and discrete actions are being multiplied against each other, that is probably being done inside the loss function. Since the gradient is obtained by differentiating the loss function, then both discrete and continuous actions are being parametrised by the same weights, i.e. they are outputted by the same neural network.

I guess this could also be done with two neural networks (one for continuous and one for discrete actions), but then you would need partial derivatives with respect to each neural network weights.

Last edited: Jun 22, 2022