Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice
  2. Unity support for visionOS is now available. Learn more in our blog post.
    Dismiss Notice

Question SAC Training: Rewards Dropping Off

Discussion in 'ML-Agents' started by Aleksander_Wit, Mar 31, 2022.

  1. Aleksander_Wit

    Aleksander_Wit

    Joined:
    Dec 13, 2020
    Posts:
    19
    Hello. I have been attempting to transition from PPO to SAC as my trainer of choice. SAC seems very promising due to it's potentially more general solutions as well as potentially higher sample efficiency if used right.

    However, up to now, SAC training has to a large degree been a failure for me compared to PPO. The training is incredibly much slower, unstable and often seems to collapse and flatline as an end result. I was hoping someone could help me understand what I am doing so wrong?

    upload_2022-3-31_20-21-52.png
    upload_2022-3-31_20-22-18.png

    Without red, which had extreme starting entropy coefficient:
    upload_2022-3-31_20-23-14.png
    upload_2022-3-31_20-23-39.png

    Curiosity is unused here. Pink has only been training for very few steps and uses PPO, unlike the others that use SAC. Already at this minuscule amount of steps, in the grand scheme of things, it is outperforming the solutions that have trained for 50m - 100m steps.
     
  2. Aleksander_Wit

    Aleksander_Wit

    Joined:
    Dec 13, 2020
    Posts:
    19
    Pink PPO finished training (big temporary drop is me messing around and stopping the training many times). As you can see, it greatly excels in an identical environment and even takes less time per step :/

    upload_2022-4-1_0-37-44.png
     
  3. Aleksander_Wit

    Aleksander_Wit

    Joined:
    Dec 13, 2020
    Posts:
    19
    More data, the green/turqoise one has fewer available resources to live on in-environment and as such performs a bit worse. Higher lifetime is better, 1200 is maximum lifetime.

    upload_2022-4-1_12-27-18.png