Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

Resolved What output activation function does the Reacher Agent use?

Discussion in 'ML-Agents' started by weight_theta, May 10, 2021.

  1. weight_theta

    weight_theta

    Joined:
    Aug 23, 2020
    Posts:
    65
    Hi everyone,

    I have been curious about what activation function the Reacher Agent uses? The output of the neural network controls the torques (twisting force) to be applied to the actuators. Would this then still be swish applied to the output layer as ml agent usually does or would it be a linear activation function ?
     
  2. andrewcoh_unity

    andrewcoh_unity

    Unity Technologies

    Joined:
    Sep 5, 2019
    Posts:
    162
    The output layer of our policies uses Swish activation which is subsequently used to parameterize the action distribution. In the case of Reacher which uses continuous actions, the action distribution is represented by a Gaussian distribution. So, the output layer outputs a vector (after Swish) that is used as the mean and std of a gaussian distribution.

    In the case of continuous actions, this is also clipped/squashed to between -1 and 1.

    Hope this helps.
     
    weight_theta likes this.
  3. weight_theta

    weight_theta

    Joined:
    Aug 23, 2020
    Posts:
    65
    The beta parameter in the swish activation function is either 1) a constant or 2) a tuneable parameter. Would you know what is happening in the background ? If a constant is used what is the value of that constant ?
     
  4. andrewcoh_unity

    andrewcoh_unity

    Unity Technologies

    Joined:
    Sep 5, 2019
    Posts:
    162
  5. weight_theta

    weight_theta

    Joined:
    Aug 23, 2020
    Posts:
    65
  6. ervteng_unity

    ervteng_unity

    Unity Technologies

    Joined:
    Dec 6, 2018
    Posts:
    150
    The agent uses the normal PPO and SAC loss functions from those respective algorithms - they're not MSE exactly, but similar idea (policy -> advantage, critic -> returns).