This module will include some useful interaction loops for types of RL agents. It'll be updated over time.

polgrad_interaction_loop[source]

polgrad_interaction_loop(env:Env, agent:Module, buffer:PGBuffer, num_interactions:int=4000, horizon:int=1000)

Interaction loop for actor-critic policy gradient agent.

This loop does not handle converting between PyTorch Tensors and NumPy arrays. So either your env should first be wrapped in ToTorchWrapper or your agent should accept and return NumPy arrays.

Args:

  • env (gym.Env): Environment to run in.
  • agent (nn.Module): Agent to run within the environment, generates actions, values, and logprobs at each step.
  • buffer (rl_bolts.buffers.PGBuffer-like): Buffer object with same API and function signatures as the PGBuffer.
  • num_interactions (int): How many interactions to collect in the environment.
  • horizon (int): Maximum allowed episode length.

Returns:

  • buffer (rl_bolts.buffers.PGBuffer-like): Buffer filled with interactions.
  • infos (dict): Dictionary of reward and episode length statistics.
  • env_infos (list of dicts): List of all info dicts from the environment.

Here we demonstrate hypothetical usage of the interaction loop.

env = gym.make("CartPole-v1") # make the environment
env = env_wrappers.ToTorchWrapper(env) # wrap it for conversion to/from torch.Tensors
agent = neuralnets.ActorCritic( # make the actor-critic agent
    env.observation_space.shape[0],
    env.action_space,
)
buf = buffers.PGBuffer(env.observation_space.shape, env.action_space.shape, 4000) # create empty buffer
full_buf, infos, env_infos = polgrad_interaction_loop(env, agent, buf) # run loop, fills buffer
for k, v in infos.items(): # print loop stats
    print(f"{k}: {v}")
MeanEpReturn: 25.477707006369428
StdEpReturn: 14.071059873100223
MaxEpReturn: 100.0
MinEpReturn: 9.0
MeanEpLength: 25.477707006369428
StdEpLength: 14.071059873100223