Here we provide a useful set of environment wrappers.

class ToTorchWrapper[source]

ToTorchWrapper(env:Env) :: Wrapper

Environment wrapper for converting actions from torch.Tensors to np.array and converting observations from np.array to torch.Tensors.

Args:

  • env (gym.Env): Environment to wrap. Should be a subclass of gym.Env and follow the OpenAI Gym API.

ToTorchWrapper.reset[source]

ToTorchWrapper.reset(*args, **kwargs)

Reset the environment.

Returns:

  • tensor_obs (torch.Tensor): output of reset as PyTorch Tensor.

ToTorchWrapper.step[source]

ToTorchWrapper.step(action:Tensor, *args, **kwargs)

Execute environment step.

Converts from torch.Tensor action and returns observations as a torch.Tensor.

Returns:

  • tensor_obs (torch.Tensor): Next observations as pytorch tensor.
  • reward (float or int): The reward earned at the current timestep.
  • done (bool): Whether the episode is in a terminal state.
  • infos (dict): The info dict from the environment.

ToTorchWrapper.action2np[source]

ToTorchWrapper.action2np(action:Tensor)

Convert torch.Tensor action to NumPy.

Args:

  • action (torch.Tensor): The action to convert.

Returns:

  • np_act (np.array or int): The action converted to numpy.

Example usage of the ToTorchWrapper is demonstrated below.

env = gym.make("CartPole-v1")
env = ToTorchWrapper(env)
obs = env.reset()
print("initial obs:", obs)
action = env.action_space.sample()
# need to convert action to PyTorch Tensor because ToTorchWrapper expects actions as Tensors.
# normally you would not need to do this, your PyTorch NN actor will output a Tensor by default.
action = torch.as_tensor(action, dtype=torch.float32)
stepped = env.step(action)
print("stepped once:", stepped)

print("\nEntering interaction loop! \n")
# interaction loop
obs = env.reset()
ret = 0
for i in range(100):
    action = torch.as_tensor(env.action_space.sample(), dtype=torch.float32)
    state, reward, done, _ = env.step(action)
    ret += reward
    if done:
        print(f"Random policy got {ret} reward!")
        obs = env.reset()
        ret = 0
        if i < 99:
            print("Starting new episode.")
    if i == 99:
        print(f"\nInteraction loop ended! Got reward {ret} before episode was cut off.")
        break
initial obs: tensor([ 0.0439, -0.0047,  0.0234,  0.0489])
stepped once: (tensor([ 0.0438,  0.1901,  0.0243, -0.2363]), 1.0, False, {})

Entering interaction loop! 

Random policy got 25.0 reward!
Starting new episode.
Random policy got 16.0 reward!
Starting new episode.
Random policy got 16.0 reward!
Starting new episode.
Random policy got 12.0 reward!
Starting new episode.
Random policy got 11.0 reward!
Starting new episode.

Interaction loop ended! Got reward 20.0 before episode was cut off.

Note: Testing needed for StateNormalizeWrapper. At present, use ToTorchWrapper for guaranteed working.

class StateNormalizeWrapper[source]

StateNormalizeWrapper(env:Env, beta:Optional[float]=0.99, eps:Optional[float]=1e-08) :: Wrapper

Environment wrapper for normalizing states.

Args:

  • env (gym.Env): Environment to wrap.
  • beta (float): Beta parameter for running mean and variance calculation.
  • eps (float): Parameter to avoid division by zero in case variance goes to zero.

StateNormalizeWrapper.reset[source]

StateNormalizeWrapper.reset(*args, **kwargs)

Reset environment and return normalized state.

Returns:

  • norm_state (np.array): Normalized state.

StateNormalizeWrapper.normalize[source]

StateNormalizeWrapper.normalize(state:array)

Update running mean and variance parameters and normalize input state.

Args:

  • state (np.array): State to normalize and to use to calculate update.

Returns:

  • norm_state (np.array): Normalized state.

StateNormalizeWrapper.step[source]

StateNormalizeWrapper.step(action:Union[array, int, float], *args, **kwargs)

Step environment and normalize state.

Args:

  • action (np.array or int or float): Action to use to step the environment.

Returns:

  • norm_state (np.array): Normalized state.
  • reward (int or float): Reward earned at step.
  • done (bool): Whether the episode is over.
  • infos (dict): Any infos from the environment.

Here is a demonstration of using the StateNormalizeWrapper.

env = gym.make("CartPole-v1")
env = StateNormalizeWrapper(env)
obs = env.reset()
print("initial obs:", obs)
# the StateNormalizeWrapper expects NumPy arrays, so there is no need to convert action to PyTorch Tensor.
action = env.action_space.sample()
stepped = env.step(action)
print("stepped once:", stepped)

print("\nEntering interaction loop! \n")
# interaction loop
obs = env.reset()
ret = 0
for i in range(100):
    action = env.action_space.sample()
    state, reward, done, _ = env.step(action)
    ret += reward
    if done:
        print(f"Random policy got {ret} reward!")
        obs = env.reset()
        ret = 0
        if i < 99:
            print("Starting new episode.")
    if i == 99:
        print(f"\nInteraction loop ended! Got reward {ret} before episode was cut off.")
        break
initial obs: [ 0.01758044 -0.04254612 -0.02514053  0.01284619]
stepped once: (array([ 0.01663708,  0.15312245, -0.02475622, -0.28764562]), 1.0, False, {})

Entering interaction loop! 

Random policy got 10.0 reward!
Starting new episode.
Random policy got 11.0 reward!
Starting new episode.
Random policy got 20.0 reward!
Starting new episode.
Random policy got 22.0 reward!
Starting new episode.
Random policy got 12.0 reward!
Starting new episode.
Random policy got 22.0 reward!
Starting new episode.

Interaction loop ended! Got reward 3.0 before episode was cut off.

Note: Testing needed for RewardScalerWrapper. At present, use ToTorchWrapper for guaranteed working.

class RewardScalerWrapper[source]

RewardScalerWrapper(env:Env, beta:Optional[float]=0.99, eps:Optional[float]=1e-08) :: Wrapper

A class for reward scaling over training.

Calculates running mean and standard deviation of observed rewards and scales the rewards using the variance.

Computes: $r_t / (\sigma + eps)$

RewardScalerWrapper.scale[source]

RewardScalerWrapper.scale(reward:Union[int, float])

Update running mean and variance for rewards, scale reward using the variance.

Args:

  • reward (int or float): reward to scale.

Returns:

  • scaled_rew (float): reward scaled using variance.

RewardScalerWrapper.step[source]

RewardScalerWrapper.step(action, *args, **kwargs)

Step the environment and scale the reward.

Args:

  • action (np.array or int or float): Action to use to step the environment.

Returns:

  • state (np.array): Next state from environment.
  • scaled_rew (float): reward scaled using the variance.
  • done (bool): Indicates whether the episode is over.
  • infos (dict): Any information from the environment.

An example usage of the RewardScalerWrapper.

env = gym.make("CartPole-v1")
env = RewardScalerWrapper(env)
obs = env.reset()
print("initial obs:", obs)
action = env.action_space.sample()
stepped = env.step(action)
print("stepped once:", stepped)

print("\nEntering interaction loop! \n")
# interaction loop
obs = env.reset()
ret = 0
for i in range(100):
    action = env.action_space.sample()
    state, reward, done, _ = env.step(action)
    ret += reward
    if done:
        print(f"Random policy got {ret} reward!")
        obs = env.reset()
        ret = 0
        if i < 99:
            print("Starting new episode.")
    if i == 99:
        print(f"\nInteraction loop ended! Got reward {ret} before episode was cut off.")
        break
initial obs: [-0.03681186 -0.01856562  0.01785368 -0.03059186]
stepped once: (array([-0.03718318, -0.213939  ,  0.01724184,  0.26767019]), 0.9900985098023393, False, {})

Entering interaction loop! 

Random policy got 25.870551503555898 reward!
Starting new episode.
Random policy got 6.588056312915322 reward!
Starting new episode.
Random policy got 26.21475981461599 reward!
Starting new episode.
Random policy got 6.0767512893302875 reward!
Starting new episode.

Interaction loop ended! Got reward 2.871941385677035 before episode was cut off.

Combining Wrappers

All of these wrappers can be composed together! Simply be sure to call the ToTorchWrapper last, because the others expect NumPy arrays as input, and the ToTorchWrapper converts outputs to PyTorch tensors. Below is an example.

env = gym.make("CartPole-v1")
env = StateNormalizeWrapper(env)
print(f"After wrapping with StateNormalizeWrapper, output is still a NumPy array: {env.reset()}")
env = RewardScalerWrapper(env)
print(f"After wrapping with RewardScalerWrapper, output is still a NumPy array: {env.reset()}")
env = ToTorchWrapper(env)
print(f"But after wrapping with ToTorchWrapper, output is now a PyTorch Tensor: {env.reset()}")
After wrapping with StateNormalizeWrapper, output is still a NumPy array: [-0.0072026  -0.00074714  0.01404444  0.01655632]
After wrapping with RewardScalerWrapper, output is still a NumPy array: [-0.01601177 -0.03326409 -0.02039952  0.02392616]
But after wrapping with ToTorchWrapper, output is now a PyTorch Tensor: tensor([-0.0485,  0.0209, -0.0479, -0.0501])

Note: Testing needed for BestPracticesWrapper. At present, use ToTorchWrapper for guaranteed working.

class BestPracticesWrapper[source]

BestPracticesWrapper(env:Env) :: Wrapper

This wrapper combines the wrappers which we think (from experience and from reading papers/blogs and watching lectures) constitute best practices.

At the moment it combines the wrappers below in the order listed:

  1. StateNormalizeWrapper
  2. RewardScalerWrapper
  3. ToTorchWrapper

Args:

  • env (gym.Env): Environment to wrap.

BestPracticesWrapper.reset[source]

BestPracticesWrapper.reset()

Reset environment.

Returns:

  • obs (torch.Tensor): Starting observations from the environment.

BestPracticesWrapper.step[source]

BestPracticesWrapper.step(action, *args, **kwargs)

Step the environment forward using input action.

Args:

  • action (torch.Tensor): Action to step the environment with.

Returns:

  • obs (torch.Tensor): Next step observations.
  • reward (int or float): Reward for the last timestep.
  • done (bool): Whether the episode is over.
  • infos (dict): Dictionary of any info from the environment.

Below is a usage example of the BestPracticesWrapper. It is used in the same way as the ToTorchWrapper.

env = gym.make("CartPole-v1")
env = BestPracticesWrapper(env)
obs = env.reset()
print("initial obs:", obs)
action = torch.as_tensor(env.action_space.sample(), dtype=torch.float32)
stepped = env.step(action)
print("stepped once:", stepped)

print("\nEntering interaction loop! \n")
# interaction loop
obs = env.reset()
ret = 0
for i in range(100):
    action = torch.as_tensor(env.action_space.sample(), dtype=torch.float32)
    state, reward, done, _ = env.step(action)
    ret += reward
    if done:
        print(f"Random policy got {ret} reward!")
        obs = env.reset()
        ret = 0
        if i < 99:
            print("Starting new episode.")
    if i == 99:
        print(f"\nInteraction loop ended! Got reward {ret} before episode was cut off.")
        break
initial obs: tensor([-0.0468, -0.0292, -0.0462,  0.0099])
stepped once: (tensor([-0.0471, -0.2234, -0.0458,  0.2874]), 0.9900985098023393, False, {})

Entering interaction loop! 

Random policy got 22.00490875509153 reward!
Starting new episode.
Random policy got 22.999644404672914 reward!
Starting new episode.
Random policy got 16.764618492994995 reward!
Starting new episode.
Random policy got 4.907345113364475 reward!
Starting new episode.

Interaction loop ended! Got reward 0.9455435399706331 before episode was cut off.