Search Unity

Buffer Size Parameter - Clarification

Discussion in 'ML-Agents' started by BotAcademy, Jul 18, 2020.

  1. BotAcademy

    BotAcademy

    Joined:
    May 15, 2020
    Posts:
    32
    Hey!

    I was just wondering if I understood the buffer_size parameter correctly. The documentation confused me somehow.

    Documentation

    (default = 10240 for PPO and 50000 for SAC) Number of experiences to collect before updating the policy model. Corresponds to how many experiences should be collected before we do any learning or updating of the model. This should be multiple times larger than batch_size. Typically a larger buffer_size corresponds to more stable training updates. In SAC, the max size of the experience buffer - on the order of thousands of times longer than your episodes, so that SAC can learn from old as well as new experiences.

    Typical range: PPO: 2048 - 409600; SAC: 50000 - 1000000

    My understanding is the following:
    PPO: Policy updates occur every time the number of experiences defined for this parameter are collected. If we set this value to 3k it will take 3k agent steps to collect 3k experiences. Those experiences are then split up into batches of our defined batch_size and fed into the neural network to perform the weight updates. This then happens every 3k agent steps.

    SAC: Defines the size of the experience replay buffer and not the update frequency which is specified in a parameter named steps_per_update.

    If my understanding is correct (please let me know if I got something wrong), I'd rephrase the documentation text to something like this:

    Updated Documentation

    (default = 10240 for PPO and 50000 for SAC) - different behavior for PPO and SAC!
    PPO: Number of experiences to collect before updating the policy model. Corresponds to how many experiences should be collected before we do any learning or updating of the model. This should be multiple times larger than batch_size.Typically a larger buffer_size corresponds to more stable training updates.
    SAC: max size of the experience buffer - on the order of thousands of times greater than the episode length so that SAC can learn from old as well as new experiences.

    Typical range: PPO: 2048 - 409600; SAC: 50000 - 1000000


    If I understood it correctly, please let me know if you prefer the updated text, so that I can make a pull request on GitHub.
     
  2. vincentpierre

    vincentpierre

    Joined:
    May 5, 2017
    Posts:
    160
    Hi,
    Your understanding is correct and I think this change would clarify our documentation. Do you want to make a PR or do you want me to take care of it?
     
  3. BotAcademy

    BotAcademy

    Joined:
    May 15, 2020
    Posts:
    32
    Okay :) I can make a PR - will post the PR link here during the next hour
     
  4. BotAcademy

    BotAcademy

    Joined:
    May 15, 2020
    Posts:
    32
  5. An-u-rag

    An-u-rag

    Joined:
    Nov 23, 2020
    Posts:
    6
    Hi, I came across this trying to find an even more detailed explanation of what exactly buffer size is. But now I think I am unclear about experiences itself. This is specially with regards to PPO.

    From my understanding, an experience is a step with a query to the policy and not just a regular agent step. What I mean is that if the "Max Step" parameter in the Agent's Script parameters is set to suppose 2500 - this refers to the number of actions the agent performs per 1 episode. This is analogous to how many times fixedUpdate is called. However, with the introduction of the "Decision Period" parameter in the Decision Requester component, things get a little more confusing. Suppose Decision Period is set to 5, this means that after every 5 actions, the policy is actually queried to get a policy action and then consequently a reward. This makes it so that 2500/5 = 500 steps are the "true" steps that are actually "experiences". Am I right in thinking this?

    So assuming I am right, we have 500 experiences per episode. Out of these 500 experiences per episode only previous "time_horizon" amount of them contribute to the outcome of that episode. Am I right about the time_horizon part?

    Now, after "buffer_size" amount of experiences, the policy is actually updated. So if my buffer_size is 30000, then after 30000 experiences or 30000/500 = 60 episodes, the policy is updated. I am assuming this buffer is called the experience buffer.

    Next, "batch_size" is another challenging thing to understand. We split our "buffer_size" experiences into n "batch_size" batches of experiences to "update" the policy n times. So if the batch_size is 300 and buffer_size is 30000, we do 30000/300 = 100 "iterations" of gradient descent. Now I don't understand where "num_epoch" comes into play here or what it's purpose is.

    Next question I have is, how do I calculate the memory of my experience buffer ? I keep getting errors related to "sequence_length invalid" or "broken pipe" or " UnityEnvironment worker 0: environment raised an unexpected exception." when I try to increase my buffer_size >= 8192. I know increasing buffer size can lead to more "RAM? VRAM?" consumption but I believe this is a relatively small buffer size and I should not be getting these errors. I will post the error logs below but before that I want to clarify the memory calculation.

    Memory(Experience buffer) = Memory(Observations + Actions + Reward) * buffer_size

    Is this correct?

    In my scenario, I just want a car to put a ball in the goal.

    Car has 2 continuous input actions:
    1. Throttle - Forward / Backward Acceleration - Float
    2. Steering Direction - Float
    Mem(Actions) = 4 + 4 = 8 Bytes

    Observations:
    1. 32 x 32 Grayscale FPS Visual Observation
    2. Single Raycast Distance - Float
    3. Current Steering Direction - Float
    4. Current Throttle - Float
    Mem(Observation) = (32 x 32) + 4 + 4 + 4 = 1024 + 4 + 4 + 4 = 1036 Bytes

    Rewards:
    1. Discrete reward when Car makes contact with Ball
    2. Discrete reward when Ball makes contact with Goal
    3. Inverse Distance Squared from Car to Ball (cutoff after contact with ball)
    4. Inverse Distance Squared from Ball to Goal (starts after contact with ball)
    Mem(Rewards) = 4 (since all are added together to one float)

    Taking these into the equation:
    Mem(Experience Buffer) = (8 + 1036 + 4) * 8192 = 8,585,216 bytes = 8.5 MB

    If this is true, then I should be having no problem with my 16 GB RAM and Nvidia 3070 Ti with dedicated 8 GB VRAM. I am stating both because I fail to understand still how to properly utilize the GPU during mlagents training due to the poor documentation on this subject. The only thing I am doing to utilize my GPU right now is adding --torch-device=cuda to my mlagents-learn command. I have of course downloaded pytorch that is built with cuda and made sure to get the corresponding CUDA toolkit version. I have no idea where this experience buffer is being stored. I checked task manager and that was pretty unhelpful too.

    I would really appreciate it if someone could clarify these for me.


    Error Logs from my latest run batch_size 1024 and buffer_size 10240:
    (mlagents) C:\Users\Anurag\ml-agents-latest_release>mlagents-learn config/Car2Ball_visual_curiosity_config_v3.yaml --run-id=test3_1024_10240 --torch-device=cuda --resume

    ┐ ╖
    ╓╖╬│╡ ││╬╖╖
    ╓╖╬│││││┘ ╬│││││╬╖
    ╖╬│││││╬╜ ╙╬│││││╖╖ ╗╗╗
    ╬╬╬╬╖││╦╖ ╖╬││╗╣╣╣╬ ╟╣╣╬ ╟╣╣╣ ╜╜╜ ╟╣╣
    ╬╬╬╬╬╬╬╬╖│╬╖╖╓╬╪│╓╣╣╣╣╣╣╣╬ ╟╣╣╬ ╟╣╣╣ ╒╣╣╖╗╣╣╣╗ ╣╣╣ ╣╣╣╣╣╣ ╟╣╣╖ ╣╣╣
    ╬╬╬╬┐ ╙╬╬╬╬│╓╣╣╣╝╜ ╫╣╣╣╬ ╟╣╣╬ ╟╣╣╣ ╟╣╣╣╙ ╙╣╣╣ ╣╣╣ ╙╟╣╣╜╙ ╫╣╣ ╟╣╣
    ╬╬╬╬┐ ╙╬╬╣╣ ╫╣╣╣╬ ╟╣╣╬ ╟╣╣╣ ╟╣╣╬ ╣╣╣ ╣╣╣ ╟╣╣ ╣╣╣┌╣╣╜
    ╬╬╬╜ ╬╬╣╣ ╙╝╣╣╬ ╙╣╣╣╗╖╓╗╣╣╣╜ ╟╣╣╬ ╣╣╣ ╣╣╣ ╟╣╣╦╓ ╣╣╣╣╣
    ╙ ╓╦╖ ╬╬╣╣ ╓╗╗╖ ╙╝╣╣╣╣╝╜ ╘╝╝╜ ╝╝╝ ╝╝╝ ╙╣╣╣ ╟╣╣╣
    ╩╬╬╬╬╬╬╦╦╬╬╣╣╗╣╣╣╣╣╣╣╝ ╫╣╣╣╣
    ╙╬╬╬╬╬╬╬╣╣╣╣╣╣╝╜
    ╙╬╬╬╣╣╣╜


    Version information:
    ml-agents: 1.0.0,
    ml-agents-envs: 1.0.0,
    Communicator API: 1.5.0,
    PyTorch: 1.13.1+cu117
    [WARNING] Training status file not found. Not all functions will resume properly.
    [INFO] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.
    [INFO] Connected to Unity environment with package version 3.0.0-exp.1 and communication version 1.5.0
    [INFO] Connected new brain: Car2Ball?team=0
    [INFO] Hyperparameters for behavior name Car2Ball:
    trainer_type: ppo
    hyperparameters:
    batch_size: 1024
    buffer_size: 10240
    learning_rate: 0.0003
    beta: 0.005
    epsilon: 0.2
    lambd: 0.95
    num_epoch: 3
    shared_critic: True
    learning_rate_schedule: linear
    beta_schedule: constant
    epsilon_schedule: linear
    checkpoint_interval: 500000
    network_settings:
    normalize: False
    hidden_units: 128
    num_layers: 2
    vis_encode_type: simple
    memory: None
    goal_conditioning_type: hyper
    deterministic: False
    reward_signals:
    extrinsic:
    gamma: 0.99
    strength: 1.0
    network_settings:
    normalize: False
    hidden_units: 128
    num_layers: 2
    vis_encode_type: simple
    memory: None
    goal_conditioning_type: hyper
    deterministic: False
    curiosity:
    gamma: 0.99
    strength: 0.02
    network_settings:
    normalize: False
    hidden_units: 128
    num_layers: 2
    vis_encode_type: simple
    memory: None
    goal_conditioning_type: hyper
    deterministic: False
    learning_rate: 0.003
    encoding_size: None
    init_path: None
    keep_checkpoints: 5
    even_checkpoints: False
    max_steps: 30000000
    time_horizon: 128
    summary_freq: 50000
    threaded: True
    self_play: None
    behavioral_cloning: None
    [INFO] Resuming from results\test3_1024_10240\Car2Ball.
    [INFO] Resuming training from step 499978.
    [INFO] Car2Ball. Step: 500000. Time Elapsed: 6.339 s. No episode was completed since last summary. Training.
    [INFO] Exported results\test3_1024_10240\Car2Ball\Car2Ball-499978.onnx
    [INFO] Car2Ball. Step: 550000. Time Elapsed: 46.842 s. Mean Reward: 166.833. Std of Reward: 85.181. Training.
    [INFO] Car2Ball. Step: 600000. Time Elapsed: 89.331 s. Mean Reward: 154.484. Std of Reward: 67.336. Training.
    [INFO] Car2Ball. Step: 650000. Time Elapsed: 131.203 s. Mean Reward: 140.996. Std of Reward: 85.288. Training.
    [INFO] Car2Ball. Step: 700000. Time Elapsed: 173.653 s. Mean Reward: 152.901. Std of Reward: 66.126. Training.
    [INFO] Car2Ball. Step: 750000. Time Elapsed: 217.055 s. Mean Reward: 146.363. Std of Reward: 74.872. Training.
    [INFO] Car2Ball. Step: 800000. Time Elapsed: 256.871 s. Mean Reward: 148.012. Std of Reward: 72.254. Training.
    [INFO] Car2Ball. Step: 850000. Time Elapsed: 298.509 s. Mean Reward: 152.311. Std of Reward: 93.526. Training.
    [INFO] Car2Ball. Step: 900000. Time Elapsed: 340.512 s. Mean Reward: 147.693. Std of Reward: 92.437. Training.
    [INFO] Car2Ball. Step: 950000. Time Elapsed: 382.421 s. Mean Reward: 152.774. Std of Reward: 66.762. Training.
    Exception in thread Thread-2 (trainer_update_func):
    Traceback (most recent call last):
    [ERROR] UnityEnvironment worker 0: environment raised an unexpected exception.
    Traceback (most recent call last):
    File "C:\Users\Anurag\miniconda3\envs\mlagents\lib\multiprocessing\connection.py", line 312, in _recv_bytes
    nread, err = ov.GetOverlappedResult(True)
    BrokenPipeError: [WinError 109] The pipe has been ended

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
    File "C:\Users\Anurag\miniconda3\envs\mlagents\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 175, in worker
    req: EnvironmentRequest = parent_conn.recv()
    File "C:\Users\Anurag\miniconda3\envs\mlagents\lib\multiprocessing\connection.py", line 250, in recv
    buf = self._recv_bytes()
    File "C:\Users\Anurag\miniconda3\envs\mlagents\lib\multiprocessing\connection.py", line 321, in _recv_bytes
    raise EOFError
    EOFError
    Process Process-1:
    Traceback (most recent call last):
    File "C:\Users\Anurag\miniconda3\envs\mlagents\lib\multiprocessing\connection.py", line 312, in _recv_bytes
    nread, err = ov.GetOverlappedResult(True)
    BrokenPipeError: [WinError 109] The pipe has been ended

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
    File "C:\Users\Anurag\miniconda3\envs\mlagents\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 175, in worker
    req: EnvironmentRequest = parent_conn.recv()
    File "C:\Users\Anurag\miniconda3\envs\mlagents\lib\multiprocessing\connection.py", line 250, in recv
    buf = self._recv_bytes()
    File "C:\Users\Anurag\miniconda3\envs\mlagents\lib\multiprocessing\connection.py", line 321, in _recv_bytes
    raise EOFError
    EOFError

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
    File "C:\Users\Anurag\miniconda3\envs\mlagents\lib\multiprocessing\process.py", line 314, in _bootstrap
    self.run()
    File "C:\Users\Anurag\miniconda3\envs\mlagents\lib\multiprocessing\process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    File "C:\Users\Anurag\miniconda3\envs\mlagents\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 235, in worker
    _send_response(EnvironmentCommand.ENV_EXITED, ex)
    File "C:\Users\Anurag\miniconda3\envs\mlagents\lib\site-packages\mlagents\trainers\subprocess_env_manager.py", line 150, in _send_response
    parent_conn.send(EnvironmentResponse(cmd_name, worker_id, payload))
    File "C:\Users\Anurag\miniconda3\envs\mlagents\lib\multiprocessing\connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
    File "C:\Users\Anurag\miniconda3\envs\mlagents\lib\multiprocessing\connection.py", line 280, in _send_bytes
    ov, err = _winapi.WriteFile(self._handle, buf, overlapped=True)
    BrokenPipeError: [WinError 232] The pipe is being closed


    Thanks
    Anurag