rl_bolts is intended to be a package of nuts and bolts of RL algorithms, along with some full implementations of RL algorithms.

rl_bolts is starting as a package of just nuts and bolts of RL, and algorithms (and new nuts and bolts) will be added over time, based on necessity.

Install

git clone https://github.com/jfpettit/rl_bolts.git

cd rl_bolts

pip install -r requirements.txt

How to use

Import the bits you need to use in your code.

The bit below sets up an actor-critic network for the CartPole-v1 gym environment.

import rl_bolts.neuralnets as nns
import gym
import torch
env = gym.make("CartPole-v1")
actor_critic = nns.ActorCritic(
    env.observation_space.shape[0],
    env.action_space
)

We can print out the architecture of our actor_critic net below:

actor_critic
ActorCritic(
  (policy): CategoricalPolicy(
    (net): MLP(
      (layers): ModuleList(
        (0): Linear(in_features=4, out_features=32, bias=True)
        (1): Linear(in_features=32, out_features=32, bias=True)
        (2): Linear(in_features=32, out_features=2, bias=True)
      )
    )
  )
  (value_f): MLP(
    (layers): ModuleList(
      (0): Linear(in_features=4, out_features=32, bias=True)
      (1): Linear(in_features=32, out_features=32, bias=True)
      (2): Linear(in_features=32, out_features=1, bias=True)
    )
  )
)
obs = env.reset()
action, logp, value = actor_critic.step(torch.as_tensor(obs, dtype=torch.float32))

The cell above starts the environment in a new episode, and passes it through the actor-critic to get an action, action log probability, and value estimate for the state.

print("action", action)
print("logp", logp)
print("value", value)
action tensor(0)
logp tensor(-0.6733)
value tensor(0.1155)

Using a pre-built algorithm

While the primary aim of this package is to provide some building blocks for RL algorithms, we'll also provide implementations of a few plug-and-play algorithms. At present, we've implemented PPO (it still needs to be thoroughly benchmarked, so be aware of that). Here is how to use it.

from rl_bolts.algorithms import PPO # import the PPO algorithm
import pytorch_lightning as pl # PPO is a pytorch-lightning module, so need their library for Trainer.
env_to_train_in = "CartPole-v1" # set env to train PPO in. 
agent = PPO(env_to_train_in) # initialize agent
trainer = pl.Trainer(reload_dataloaders_every_epoch=True, max_epochs=1) # set up trainer, in practice you'd set max_epochs to more than one
trainer.fit(agent) # run trainer
GPU available: False, used: False
TPU available: False, using: 0 TPU cores

  | Name         | Type        | Params
---------------------------------------------
0 | actor_critic | ActorCritic | 2 K   
Epoch 1: 100%|██████████| 1/1 [00:00<00:00,  1.44it/s, loss=96.740, v_num=5, PolicyLoss=6.82e-8, DeltaPolLoss=-.0247, KL=0.0125, Entropy=0.684, TimesEarlyStopped=0, AvgEarlyStopStep=0, ValueLoss=279, DeltaValLoss=-85.7]

MeanEpReturn: 22.587570621468927
StdEpReturn: 12.280362500254949
MaxEpReturn: 79.0
MinEpReturn: 9.0
MeanEpLength: 22.587570621468927
StdEpLength: 12.280362500254949
PolicyLoss: 6.818771680627833e-08
DeltaPolLoss: -0.024729875847697258
KL: 0.012456100434064865
Entropy: 0.6838463544845581
TimesEarlyStopped: 0
AvgEarlyStopStep: 0
ValueLoss: 279.19268798828125
DeltaValLoss: -85.71368408203125


Epoch 1: 100%|██████████| 1/1 [00:02<00:00,  2.49s/it, loss=96.740, v_num=5, PolicyLoss=6.82e-8, DeltaPolLoss=-.0247, KL=0.0125, Entropy=0.684, TimesEarlyStopped=0, AvgEarlyStopStep=0, ValueLoss=279, DeltaValLoss=-85.7]
1