In the code,I see the code and note on rl_rainer.py on 133 row with hierarchical_timer("process_trajectory"): for traj_queue in self.trajectory_queues: # We grab at most the maximum length of the queue. # This ensures that even if the queue is being filled faster than it is # being emptied, the trajectories in the queue are on-policy. _queried = False for _ in range(traj_queue.qsize()): _queried = True try: t = traj_queue.get_nowait() self._process_trajectory(t) except AgentManagerQueue.Empty: break I can't understand why the queue is on policy. The later trajectory when the policy hasn't been update seems to use the old policy.
'on-policy' is a reinforcement learning technical term that means a policy update should only be computed with trajectories sampled from that policy. Does this answer your question or are you concerned with the implementation?
I am concerned with the implementation. Because I don't know how to keep the on policy with the Queue. GA3C and the IMPALA uses the Queue to comunicate between the trainer and the actor are all off-policy. Does MLagent stop the step and use all trajecroy even the experience step nums dont reach the max to train the policy?