Search Unity

Question Why am I getting the same int numbers instead of random float ones on the ContinuousActions vectors?

Discussion in 'ML-Agents' started by Joseguepardo, May 25, 2023.

  1. Joseguepardo

    Joseguepardo

    Joined:
    Mar 3, 2015
    Posts:
    8
    Hello everyone, I was wondering if anyone has had this situation... I am trying to do some training with the MLAgents Unity package and I went through all of the guides. I set up a simple scenario in which I am requesting 3 values of Continuous type and no Discrete ones, as you can see here:

    upload_2023-5-24_23-8-55.png

    But for some reason that I don't understand, instead of getting 'random' values between -1 and 1, I keep getting 0, 1, and -1, 99% of the time, from the very beginning the game starts. Here are my logs on the console (please note that the collapse option is enabled, so you can see the number of times each number is being repeated) and the snipped on the code I'm using for this simple test:

    upload_2023-5-24_23-9-26.png
    upload_2023-5-24_23-9-36.png

    Yes, I am getting some random values on the first vector (0) of the ActionBuffers, however, it only happens sometimes and for my particular scenario, it doesn't really help me as the agent would basically never get to the point where it can collect a positive reward.

    I've been trying to figure this one out for hours and haven't been able to figure out why... I did another test with a video I found on youtube, trying to get a cube to move toward a specific position, for this other test I needed 2 values, and for this one, every single time, I am indeed getting different 'random' values, as I should expect... However, I am not doing anything differently! So how is this case different from the other one? Just for reference, this is how my other test (where I was simply trying out the library) looks:
    upload_2023-5-24_23-15-18.png upload_2023-5-24_23-15-35.png

    And as I mentioned before, this one does return values between -1 and 1, and therefore the agent can actually learn how to get to the reward.

    It got to a point in which I even considered my PC was somehow faulty or damaged, so I formatted my whole system and did some maintenance on it, but it keeps giving me the same single 0, 1, and -1 values all the time. I also created a build and tried out that method, same exact results. I then tried the build on a different PC, same exact results!

    I would highly appreciate anyone's help! Honestly, at this point, I don't really know what else to try... It just is not making any sense to me :(
     
    Hyud1 likes this.
  2. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    Is this happening after training for some steps or is it immediately outputting the whole numbers?

    My initial guess would be some un-normalized observations blowing up the decision space.
     
  3. Joseguepardo

    Joseguepardo

    Joined:
    Mar 3, 2015
    Posts:
    8
    It happens immediately. Your guess is actually right! Although I still don't understand why... I was able to track the issue back to the information vectors, I'm currently sending 5 vectors, the object's position and 2 velocity values which are 50 and 300, and if I put 0.005 and 0.3 for the other 2 vectors instead of the whole values, it actually works. Which doesn't make any sense to me because I don't think it is specified on the documentation that these values have to be normalized. And even if that's the case, why do the position values not break this rule then? Because there I have values way bigger than 1, but with those values it does work, it seems to only start failing when there are values bigger than 1 on the fourth vector and onwards.
     
  4. Luke-Houlihan

    Luke-Houlihan

    Joined:
    Jun 26, 2007
    Posts:
    303
    Nice, glad you could get it working!

    Check out the docs here - https://github.com/Unity-Technologi...ng-Environment-Design-Agents.md#normalization to read a bit about it.

    An easy intuition for this is (inaccurately) viewing the policy model as a signal multiplier, if your observation signal is 300 how many times can it be randomly multiplied before one of the float values goes full NaN on you? With a huge range of input values the network outputs a huge range of outputs which are then clamped to [-1,1] meaning you will always get -1 or 1. My guess is the 0's are NaNs.