Search Unity

  1. Megacity Metro Demo now available. Download now.
    Dismiss Notice

Question Agent performing worse during inference

Discussion in 'ML-Agents' started by gblawrence03, Jan 1, 2024.

  1. gblawrence03

    gblawrence03

    Joined:
    Oct 8, 2021
    Posts:
    1
    Hi everyone, I'm having a really frustrating problem with an agent where it is performing much worse during inference. I'll try and include as much detail as I can:

    Environment:
    The agent acts in a 3D environment, controlling a rocket booster that starts in the air a hundred metres or so above the landing pad. It's goal is to land softly. The agent has individual control over the booster's nine engines.

    Actions:
    The agent takes both discrete and continuous actions:
    • It has 9 discrete actions, one for each engine, with 5 choices each: no action, start up engine, shut down engine, throttle up, throttle down.
    • It has 18 continuous actions: the ability to gimbal each engine in the x and y direction.
    • Action masking is implemented to prevent the agent from, for example, starting up an engine that is already running.
    • A decision is requested every 5 steps (the ML-Agents default). Actions are taken between decisions.
    Observations:
    The agent receives 23 observations, including:
    • Position relative to pad
    • Rotation
    • Velocity
    • Angular velocity
    • Fuel remaining
    • Current engine throttles
    Rewards:
    • The agent receives a penalty each timestep dependent on its angular velocity, to encourage stable flight.
    • It does not receive any rewards at the end of the episode, unless it hits the pad, in which case the it receives a positive reward which is higher if it has performed a good landing (determined by landing speed, distance to the centre of the pad, etc).
    Training:
    • Training takes place on a remote Linux machine that is provided by my university. I use GitHub to transfer the built environment to the remote computer and the .onnx models back to mine for testing.
    • I run 8 instances of the environment, each of which contains 8 training areas.
    • Training occurs in real time (1x timescale)
    • My config file is below:
    Code (csharp):
    1. behaviors:
    2.     BoosterLanding:
    3.         trainer_type: ppo
    4.         max_steps: 1.0e9
    5.         time_horizon: 128
    6.         summary_freq: 100000
    7.         hyperparameters:
    8.             batch_size: 2048
    9.             beta: 2.5e-3
    10.             buffer_size: 81920
    11.             epsilon: 0.2
    12.             num_epoch: 3
    13.             lambd: 0.95
    14.             learning_rate: 2.0e-4
    15.             learning_rate_schedule: linear
    16.         network_settings:
    17.             memory:
    18.                 memory_size: 128
    19.                 sequence_length: 64
    20.             hidden_units: 128
    21.             num_layers: 3
    22.             vis_encode_type: simple
    23.             normalize: true
    24.         # use_recurrent: false
    25.         reward_signals:
    26.             extrinsic:
    27.                 strength: 1.0
    28.                 gamma: 0.99
    29.  
    The results of training, according to the Tensorboard statistics, were nearly perfect. The agent was consistently receiving nearly the theoretical maximum reward. However, when I transferred the model back to my computer, it was clearly performing worse than the statistics suggested. The statistics showed that it was consistently landing below 1 m/s, but in inference it was landing at over 20 m/s.

    If anyone has any suggestions as to what could be going wrong or tests I could carry out to determine the problem, I'd really appreciate it. Thanks.
     
  2. smallg2023

    smallg2023

    Joined:
    Sep 2, 2018
    Posts:
    144
    so is it actually getting a lower reward than the training said it was or is it just finding another way to get the "max" reward?
    if you're not happy with the landing you might just need to increase the reward it gets for landing well, now that it has a good baseline to train from it shouldn't take much to improve the landing (assuming the flight is ok, it's not clear from your post).