This module contains algorithms we choose to implement and test.

class PPO[source]

PPO(env:str, hidden_sizes:Optional[tuple]=(32, 32), gamma:Optional[float]=0.99, lam:Optional[float]=0.97, clipratio:Optional[float]=0.2, train_iters:Optional[int]=80, batch_size:Optional[int]=4000, pol_lr:Optional[float]=0.0003, val_lr:Optional[float]=0.001, maxkl:Optional[float]=0.01, seed:Optional[int]=0) :: LightningModule

Implementation of the Proximal Policy Optimization (PPO) algorithm. See the paper: https://arxiv.org/abs/1707.06347

It is a PyTorch Lightning Module. See their docs: https://pytorch-lightning.readthedocs.io/en/latest/

Args:

  • env (str): Environment to run in. Handles vector observation environments with either gym.spaces.Box or gym.spaces.Discrete action space.
  • hidden_sizes (tuple): Hidden layer sizes for actor-critic network.
  • gamma (float): Discount factor.
  • lam (float): Lambda factor for GAE-Lambda calculation.
  • clipratio (float): Clip ratio for PPO-clip objective.
  • train_iters (int): How many steps to take over the latest data batch.
  • batch_size (int): How many interactions to collect per update.
  • pol_lr (float): Learning rate for the policy optimizer.
  • val_lr (float): Learning rate for the value optimizer.
  • maxkl (float): Max allowed KL divergence between policy updates.
  • seed (int): Random seed for pytorch and numpy

Here is an example training PPO on the CartPole-v1 environment. Since it is a PyTorch-Lightning Module it is trained using their Trainer API.

Note that this PPO implementation needs to be more thoroughly benchmarked and so may be a work in progress.

The reload_dataloaders_every_epoch flag is needed to ensure that during each training step, the updates are computed on the latest batch of data.

To see how we implement this, view the source code for the PPO class.

agent = PPO("CartPole-v1")
trainer = pl.Trainer(reload_dataloaders_every_epoch=True, max_epochs=25)
trainer.fit(agent)