Search Unity

  1. Unity 6 Preview is now available. To find out what's new, have a look at our Unity 6 Preview blog post.
    Dismiss Notice
  2. Unity is excited to announce that we will be collaborating with TheXPlace for a summer game jam from June 13 - June 19. Learn more.
    Dismiss Notice

Question Average rewards start high and then decrease

Discussion in 'ML-Agents' started by Rafmoc, Sep 13, 2022.

  1. Rafmoc


    Sep 13, 2022

    I'm trying to make simple ML-AI Traders for my space game.
    I'm having a problem that every time i start learing they are better than in the end.
    I don't understand what I'm doing wrong.

    So first what agents "see"
    Code (CSharp):
    1. cyclesEnd - cyclesCount
    2. StartingCredits
    3. Credits
    4. buyingTurn; //bool. It is switching all time true / false where ture is buying and false is selling
    5. planet //where it is
    6. Price, quantity of 2 goods on planet
    7. quantity and avarege price of bought goods
    Then what they can do
    Code (CSharp):
    1. planet // 0-6
    2. firstGood // 0 - 20
    3. secondGood // 0 - 20
    So it can travel from planet to planet and can sell or buy 0-20 of any of its good.
    It can't faill as I'm masking all wrong transactions.

    Thirdly - Rewards - I'm giving rewards:
    - every time it is selling something and have more credits that on start (small fixed reward)
    - every time it have more credits than starting on end of life (small fixed reward)
    - depanding on how much it loss or earn at the end of life.

    My parameters:
    Code (CSharp):
    1. trainer_type: ppo
    2.     hyperparameters:
    3.       batch_size: 2048
    4.       buffer_size: 20480
    5.       learning_rate: 0.001
    6.       beta: 0.01
    7.       epsilon: 0.4
    8.       lambd: 0.95
    9.       num_epoch: 3
    10.       learning_rate_schedule: linear
    11.       beta_schedule: linear
    12.       epsilon_schedule: linear
    13.     network_settings:
    14.       normalize: true
    15.       hidden_units: 512
    16.       num_layers: 3
    17.       vis_encode_type: simple
    18.       memory: null
    19.       goal_conditioning_type: hyper
    20.       deterministic: false
    21.     reward_signals:
    22.       extrinsic:
    23.         gamma: 0.99
    24.         strength: 1.0
    25.     init_path: null
    26.     keep_checkpoints: 40
    27.     checkpoint_interval: 500000
    28.     max_steps: I tried big and small values, resuming and doing all at once.
    29.     time_horizon: 64
    30.     summary_freq: 10000
    31.     threaded: false
    32.     self_play: null
    33.     behavioral_cloning: null

    Teoreticly it is working perfect at start:

    But then it is not getting better at all until the end:

    AND as i start new learning it is better at start than in the end:

    It was somehow good when it started but it is going worse and worse.
    Then at the end it is as bad as it was at the end of last training:

    I dont understand why it is not imroving at all.
    Also i tried shorter trenings and it is still the same.

    Do i miss something in my parameters?
    Because it is clear that it can get better.
    At this point at the end of training it losing two times more than earing.
  2. Rafmoc


    Sep 13, 2022
    Also graphs:
  3. Qacona


    Apr 16, 2022
    Rafmoc likes this.
  4. Rafmoc


    Sep 13, 2022
    Thank you, thats something.
    But are you able to give me some example of how it can be done for ML-Agent?
    I need to catch a good context. How you would try to fix it?
    By more layers? More hidden units? some parameters could be better? Or I'm missing something codewise?
  5. Qacona


    Apr 16, 2022
    It could be both. I only understand hidden units and layers anecdotally so hopefully someone can come along and give you better advice.

    But the way I understand it is that you need more units if your model isn't smart enough to understand all the data you give it (i.e think about a 2 layer model with 16 hidden units but you're giving it 128 bits of data a tick, it's never going to be able to understand how to process that much data with only 16 units).

    If your relationships between your input and output data are indirect you need more layers (i.e. if you're asking your AI to process a letter and return a different letter (say turning the letter A into the letter B), that's a simple task so you only need a small number of layers but if you're giving it 3 dimensional coordinates and asking it to figure out how the correct pitch, yaw and roll to orient towards a target, the relationship is very very very indirect and will require far more layers).

    By way of example, my AI learned how to fly a plane with a realistic physics model (using pitch yaw and roll) through randomly placed rings and it required 1024 hidden units and 10 layers (which took 3 days to train if you're curious).
    Rafmoc likes this.
  6. Rafmoc


    Sep 13, 2022
    Thanks for your replay however it looks like it is not a case.
    I more than doubled the layers and hidden units but it trains same, bad way
    Average rewards start high and then decrease.
    But I'm still training so maybe something will be better later.