d3rlpy - An offline deep reinforcement learning library.¶
d3rlpy is a easy-to-use offline deep reinforcement learning library.
$ pip install d3rlpy
d3rlpy provides state-of-the-art offline deep reinforcement learning algorithms through out-of-the-box scikit-learn-style APIs. Unlike other RL libraries, the provided algorithms can achieve extremely powerful performance beyond their papers via several tweaks.
Tutorials¶
Getting Started¶
This tutorial is also available on Google Colaboratory
Install¶
First of all, let’s install d3rlpy
on your machine:
$ pip install d3rlpy
See more information at Installation.
Note
If core dump
error occurs in this tutorial, please try
Install from source.
Note
d3rlpy
supports Python 3.6+. Make sure which version you use.
Note
If you use GPU, please setup CUDA first.
Prepare Dataset¶
You can make your own dataset without any efforts. In this tutorial, let’s use integrated datasets to start. If you want to make a new dataset, see MDPDataset.
d3rlpy provides suites of datasets for testing algorithms and research. See more documents at Datasets.
from d3rlpy.datasets import get_cartpole # CartPole-v0 dataset
from d3rlpy.datasets import get_pendulum # Pendulum-v0 dataset
from d3rlpy.datasets import get_pybullet # PyBullet task datasets
from d3rlpy.datasets import get_atari # Atari 2600 task datasets
from d3rlpy.datasets import get_d4rl # D4RL datasets
Here, we use the CartPole dataset to instantly check training results.
dataset, env = get_cartpole()
You can split dataset
into a training dataset and a test dataset just
like supervised learning as follows.
from sklearn.model_selection import train_test_split
train_episodes, test_episodes = train_test_split(dataset, test_size=0.2)
Setup Algorithm¶
There are many algorithms avaiable in d3rlpy.
Since CartPole is the simple task, let’s start from DQN
, which is the
Q-learnig algorithm proposed as the first deep reinforcement learning algorithm.
from d3rlpy.algos import DQN
# if you don't use GPU, set use_gpu=False instead.
dqn = DQN(use_gpu=True)
# initialize neural networks with the given observation shape and action size.
# this is not necessary when you directly call fit or fit_online method.
dqn.build_with_dataset(dataset)
See more algorithms and configurations at Algorithms.
Setup Metrics¶
Collecting evaluation metrics is important to train algorithms properly. In d3rlpy, the metrics is computed through scikit-learn style scorer functions.
from d3rlpy.metrics.scorer import td_error_scorer
from d3rlpy.metrics.scorer import average_value_estimation_scorer
# calculate metrics with test dataset
td_error = td_error_scorer(dqn, test_episodes)
Since evaluating algorithms without access to environment is still difficult,
the algorithm can be directly evaluated with evaluate_on_environment
function
if the environment is available to interact.
from d3rlpy.metrics.scorer import evaluate_on_environment
# set environment in scorer function
evaluate_scorer = evaluate_on_environment(env)
# evaluate algorithm on the environment
rewards = evaluate_scorer(dqn)
See more metrics and configurations at Metrics.
Start Training¶
Now, you have all to start data-driven training.
dqn.fit(train_episodes,
eval_episodes=test_episodes,
n_epochs=10,
scorers={
'td_error': td_error_scorer,
'value_scale': average_value_estimation_scorer,
'environment': evaluate_scorer
})
Then, you will see training progress in the console like below:
augmentation=[]
batch_size=32
bootstrap=False
dynamics=None
encoder_params={}
eps=0.00015
gamma=0.99
learning_rate=6.25e-05
n_augmentations=1
n_critics=1
n_frames=1
q_func_factory=mean
scaler=None
share_encoder=False
target_update_interval=8000.0
use_batch_norm=True
use_gpu=None
observation_shape=(4,)
action_size=2
100%|███████████████████████████████████| 2490/2490 [00:24<00:00, 100.63it/s]
epoch=0 step=2490 value_loss=0.190237
epoch=0 step=2490 td_error=1.483964
epoch=0 step=2490 value_scale=1.241220
epoch=0 step=2490 environment=157.400000
100%|███████████████████████████████████| 2490/2490 [00:24<00:00, 100.63it/s]
.
.
.
See more about logging at Logging.
Once the training is done, your algorithm is ready to make decisions.
observation = env.reset()
# return actions based on the greedy-policy
action = dqn.predict([observation])[0]
# estimate action-values
value = dqn.predict_value([observation], [action])[0]
Save and Load¶
d3rlpy provides several ways to save trained models.
# save full parameters
dqn.save_model('dqn.pt')
# load full parameters
dqn2 = DQN()
dqn2.build_with_dataset(dataset)
dqn2.load_model('dqn.pt')
# save the greedy-policy as TorchScript
dqn.save_policy('policy.pt')
# save the greedy-policy as ONNX
dqn.save_policy('policy.onnx', as_onnx=True)
See more information at Save and Load.
Play with MDPDataset¶
d3rlpy provides MDPDataset
, a dedicated dataset structure for offline RL.
In this tutorial, you can learn how to play with MDPDataset
.
Check MDPDataset for more information.
Prepare Dataset¶
In this tutorial, let’s use a built-in dataset for CartPole-v0.
# prepare dataset
dataset, _ = d3rlpy.datasets.get_dataset("cartpole-random")
Understand Episode and Transition¶
MDPDataset
hierarchically structures the dataset into Episode
and
Transition
.

You can interact with this underlying data structure.
# first episode
episode = dataset.episodes[0]
# access to episode data
episode.observations
episode.actions
episode.rewards
# first transition
transition = episode.transitions[0]
# access to tuple
transition.observation
transition.action
transition.reward
transition.next_observation
# linked list structure
next_transition = transition.next_transition
assert transition is next_transition.prev_transition
Feed MDPDataset to Algorithm¶
There are multiple ways to feed datasets to algorithms for offline RL.
dqn = d3rlpy.algos.DQN()
# feed as MDPDataset
dqn.fit(dataset, n_steps=10000)
# feed as Episode
dqn.fit(dataset.episodes, n_steps=10000)
# feed as Transition
transitions = []
for episode in dataset.episodes:
transitions.extend(episode.transitions)
dqn.fit(transitions, n_steps=10000)
The advantage of this design is that you can split datasets in both episode-wise and transition-wise. If you split datasets in episode-wise manner, you can completely remove all transitions included in test episodes, which makes valiadtion work better.
# use scikit-learn utility
from sklearn.model_selection import train_test_split
# episode-wise split
train_episodes, test_episodes = train_test_split(dataset.episodes)
# setup metrics
metrics = {
"soft_opc": d3rlpy.metrics.scorer.soft_opc_scorer(return_threshold=180),
"initial_value": d3rlpy.metrics.scorer.initial_state_value_estimation_scorer,
}
# start training with episode-wise splits
dqn.fit(
train_episodes,
n_steps=10000,
scorers=metrics,
eval_episodes=test_episodes,
)
Mix Datasets¶
You can also mix multiple datasets to train algorithms.
replay_dataset, _ = d3rlpy.datasets.get_dataset("cartpole-replay")
# extends replay dataset with random dataset
replay_dataset.extend(dataset)
# you can also save it and load it later
replay_dataset.dump("mixed_dataset.h5")
mixed_dataset = MDPDataset.load("mixed_dataset.h5")
Data Collection¶
d3rlpy provides APIs to support data collection from environments. This feature is specifically useful if you want to build your own original datasets for research or practice purposes.
Prepare Environment¶
d3rlpy supports environments with OpenAI Gym interface. In this tutorial, let’s use simple CartPole environment.
import gym
env = gym.make("CartPole-v0")
Data Collection with Random Policy¶
If you want to collect experiences with uniformly random policy, you can use
RandomPolicy
and DiscreteRandomPolicy
.
This procedure corresponds to random
datasets in D4RL.
import d3rlpy
# setup algorithm
random_policy = d3rlpy.algos.DiscreteRandomPolicy()
# prepare experience replay buffer
buffer = d3rlpy.online.buffers.ReplayBuffer(maxlen=100000, env=env)
# start data collection
random_policy.collect(env, buffer, n_steps=100000)
# export as MDPDataset
dataset = buffer.to_mdp_dataset()
# save MDPDataset
dataset.dump("random_policy_dataset.h5")
Data Collection with Trained Policy¶
If you want to collect experiences with previously trained policy, you can
still use the same set of APIs.
This procedure corresponds to medium
datasets in D4RL.
# setup algorithm
dqn = d3rlpy.algos.DQN()
# initialize neural networks before loading parameters
dqn.build_with_env(env)
# load pretrained parameters
dqn.load_model("dqn_model.pt")
# prepare experience replay buffer
buffer = d3rlpy.online.buffers.ReplayBuffer(maxlen=100000, env=env)
# start data collection
dqn.collect(env, buffer, n_steps=100000)
# export as MDPDataset
dataset = buffer.to_mdp_dataset()
# save MDPDataset
dataset.dump("trained_policy_dataset.h5")
Data Collection while Training Policy¶
If you want to use experiences collected during training to build a new dataset,
you can simply use fit_online
and save the dataset.
This procedure corresponds to replay
datasets in D4RL.
# setup algorithm
dqn = d3rlpy.algos.DQN()
# prepare experience replay buffer
buffer = d3rlpy.online.buffers.ReplayBuffer(maxlen=100000, env=env)
# prepare exploration strategy if necessary
explorer = d3rlpy.online.explorers.ConstantEpsilonGreedy(0.3)
# start data collection
dqn.fit_online(env, buffer, n_steps=100000)
# export as MDPDataset
dataset = buffer.to_mdp_dataset()
# save MDPDataset
dataset.dump("replay_dataset.h5")
Create Your Dataset¶
The data collection API is introduced in Data Collection. In this tutorial, you can learn how to build your dataset from logged data such as the user data collected in your web service.
Prepare Logged Data¶
First of all, you need to prepare your logged data.
In this tutorial, let’s use randomly generated data.
terminals
represents the last step of episodes.
If terminals[i] == 1.0
, i-th step is the terminal state.
Otherwise you need to set zeros for non-terminal states.
import numpy as np
# vector observation
# 1000 steps of observations with shape of (100,)
observations = np.random.random((1000, 100))
# 1000 steps of actions with shape of (4,)
actions = np.random.random((1000, 4))
# 1000 steps of rewards
rewards = np.random.random(1000)
# 1000 steps of terminal flags
terminals = np.random.randint(2, size=1000)
Build MDPDataset¶
Once your logged data is ready, you can build MDPDataset
object.
import d3rlpy
dataset = d3rlpy.dataset.MDPDataset(
observations=observations,
actions=actions,
rewards=rewards,
terminals=terminals,
)
Set Timeout Flags¶
In RL, there is the case where you want to stop an episode without a terminal
state.
For example, if you’re collecting data of a 4-legged robot walking forward,
the walking task basically never ends as long as the robot keeps walking while
the logged episode must stop somewhere.
In this case, you can use episode_terminals
to represent this timeout states.
# terminal states
terminals = np.zeros(1000)
# timeout states
episode_terminals = np.random.randint(2, size=1000)
dataset = d3rlpy.dataest.MDPDataset(
observations=observations,
actions=actions,
rewards=rewards,
terminals=terminals,
episode_terminals=episode_terminals,
)
Preprocess / Postprocess¶
In this tutorial, you can learn how to preprocess datasets and postprocess continuous action outputs. Please check Preprocessing for more information.
Preprocess Observations¶
If your dataset includes unnormalized observations, you can normalize or
standardize the observations by specifying scaler
argument with a string alias.
In this case, the statistics of the dataset will be computed at the beginning
of offline training.
import d3rlpy
dataset, _ = d3rlpy.datasets.get_dataset("pendulum-random")
# specify by string alias
sac = d3rlpy.algos.SAC(scaler="standard")
Alternatively, you can manually instantiate preprocessing parameters.
# setup manually
mean = np.mean(dataset.observations, axis=0, keepdims=True)
std = np.std(dataset.observations, axis=0, keepdims=True)
scaler = d3rlpy.preprocessing.StandardScaler(mean=mean, std=std)
# specify by object
sac = d3rlpy.algos.SAC(scaler=scaler)
Please check Preprocessing for the full list of available observation preprocessors.
Preprocess / Postprocess Actions¶
In training with continuous action-space, the actions must be in the range
between [-1.0, 1.0]
due to the underlying tanh
activation at the policy
functions.
In d3rlpy, you can easily normalize inputs and denormalize outpus instead of
normalizing datasets by yourself.
# specify by string alias
sac = d3rlpy.algos.SAC(action_scaler="min_max")
# setup manually
minimum_action = np.min(dataset.actions, axis=0, keepdims=True)
maximum_action = np.max(dataset.actions, axis=0, keepdims=True)
action_scaler = d3rlpy.preprocessing.MinMaxActionScaler(
minimum=minimum_action,
maximum=maximum_action,
)
# specify by object
sac = d3rlpy.algos.SAC(action_scaler=action_scaler)
Please check Preprocessing for the full list of available action preprocessors.
Preprocess Rewards¶
The effect of scaling rewards is not well studied yet in RL community, however, it’s confirmed that the reward scale affects training performance.
# specify by string alias
sac = d3rlpy.algos.SAC(reward_scaler="standard")
# setup manuall
mean = np.mean(dataset.rewards, axis=0, keepdims=True)
std = np.std(dataset.rewards, axis=0, keepdims=True)
reward_scaler = StandardRewardScaler(mean=mean, std=std)
# specify by object
sac = d3rlpy.algos.SAC(reward_scaler=reward_scaler)
Please check Preprocessing for the full list of available reward preprocessors.
Customize Neural Network¶
In this tutorial, you can learn how to integrate your own neural network models to d3rlpy. Please check Network Architectures for more information.
Prepare PyTorch Model¶
If you’re familiar with PyTorch, this step should be easy for you.
Please note that your model must have get_feature_size
method to tell the
feature size to the final layer.
import torch
import torch.nn as nn
import d3rlpy
class CustomEncoder(nn.Module):
def __init__(self, observation_shape, feature_size):
self.feature_size = feature_size
self.fc1 = nn.Linear(observation_shape[0], feature_size)
self.fc2 = nn.Linear(feature_size, feature_size)
def forward(self, x):
h = torch.relu(self.fc1(x))
h = torch.relu(self.fc2(x))
return h
# THIS IS IMPORTANT!
def get_feature_size(self):
return self.feature_size
Setup EncoderFactory¶
Once you setup your PyTorch model, you need to setup EncoderFactory
.
In your EncoderFactory
class, you need to define create
and
get_params
methods as well as TYPE
attribute.
TYPE
attribute and get_params
method are used to serialize your
customized neural network configuration.
class CustomEncoderFactory(d3rlpy.models.encoders.EncoderFactory):
TYPE = "custom" # this is necessary
def __init__(self, feature_size):
self.feature_size = feature_size
def create(self, observation_shape):
return CustomEncoder(observation_shape, self.feature_size)
def get_params(self, deep=False):
return {"feature_size": self.feature_size}
Now, you can use your model with d3rlpy.
# integrate your model into d3rlpy algorithm
dqn = d3rlpy.algos.DQN(encoder_factory=CustomEncoderFactory(64))
Support Q-function for Actor-Critic¶
In the above example, your original model is designed for the network that takes an observation as an input. However, if you customize a Q-function of actor-critic algorithm (e.g. SAC), you need to prepare an action-conditioned model.
class CustomEncoderWithAction(nn.Module):
def __init__(self, observation_shape, action_size, feature_size):
self.feature_size = feature_size
self.fc1 = nn.Linear(observation_shape[0] + action_size, feature_size)
self.fc2 = nn.Linear(feature_size, feature_size)
def forward(self, x, action):
h = torch.cat([x, action], dim=1)
h = torch.relu(self.fc1(h))
h = torch.relu(self.fc2(h))
return h
def get_feature_size(self):
return self.feature_size
Finally, you can update your CustomEncoderFactory
as follows.
class CustomEncoderFactory(EncoderFactory):
TYPE = "custom"
def __init__(self, feature_size):
self.feature_size = feature_size
def create(self, observation_shape):
return CustomEncoder(observation_shape, self.feature_size)
def create_with_action(self, observation_shape, action_size, discrete_action):
return CustomEncoderWithAction(observation_shape, action_size, self.feature_size)
def get_params(self, deep=False):
return {"feature_size": self.feature_size}
Now, you can customize actor-critic algorithms.
encoder_factory = CustomEncoderFactory(64)
sac = d3rlpy.algos.SAC(
actor_encoder_factory=encoder_factory,
critic_encoder_factory=encoder_factory,
)
Online RL¶
Prepare Environment¶
d3rlpy supports environments with OpenAI Gym interface. In this tutorial, let’s use simple CartPole environment.
import gym
# for training
env = gym.make("CartPole-v0")
# for evaluation
eval_env = gym.make("CartPole-v0")
Setup Algorithm¶
Just like offline RL training, you can setup an algorithm object.
import d3rlpy
# if you don't use GPU, set use_gpu=False instead.
dqn = d3rlpy.algos.DQN(
batch_size=32,
learning_rate=2.5e-4,
target_update_interval=100,
use_gpu=True,
)
# initialize neural networks with the given environment object.
# this is not necessary when you directly call fit or fit_online method.
dqn.build_with_env(env)
Setup Online RL Utilities¶
Unlike offline RL training, you’ll need to setup an experience replay buffer and an exploration strategy.
# experience replay buffer
buffer = d3rlpy.online.buffers.ReplayBuffer(maxlen=100000, env=env)
# exploration strategy
# in this tutorial, epsilon-greedy policy with static epsilon=0.3
explorer = d3rlpy.online.explorers.ConstantEpsilonGreedy(0.3)
Start Training¶
Now, you have everything you need to start online RL training. Let’s put them together!
dqn.fit_online(
env,
buffer,
explorer,
n_steps=100000, # train for 100K steps
eval_env=eval_env,
n_steps_per_epoch=1000, # evaluation is performed every 1K steps
update_start_step=1000, # parameter update starts after 1K steps
)
Train with Stochastic Policy¶
If the algorithm uses a stochastic policy (e.g. SAC), you can train algorithms without setting an exploration strategy.
sac = d3rlpy.algos.DiscreteSAC()
sac.fit_online(
env,
buffer,
n_steps=100000,
eval_env=eval_env,
n_steps_per_epoch=1000,
update_start_step=1000,
)
Finetuning¶
d3rlpy supports smooth transition from offline training to online training.
Prepare Dataset and Environment¶
In this tutorial, let’s use a built-in dataset for CartPole-v0 environment.
import d3rlpy
# setup random CartPole-v0 dataset and environment
dataset, env = d3rlpy.datasets.get_dataset("cartpole-random")
Pretrain with Dataset¶
# setup algorithm
dqn = d3rlpy.algos.DQN()
# start offline training
dqn.fit(dataset, n_steps=100000)
Finetune with Environment¶
# setup experience replay buffer
buffer = d3rlpy.online.buffers.ReplayBuffer(maxlen=100000, env=env)
# setup exploration strategy if necessary
explorer = d3rlpy.online.explorers.ConstantEpsilonGreedy(0.1)
# start finetuning
dqn.fit_online(env, buffer, explorer, n_steps=100000)
Finetune with Saved Policy¶
If you want to finetune the saved policy, that’s also easy to do with d3rlpy.
# setup algorithm
dqn = d3rlpy.algos.DQN()
# initialize neural networks before loading parameters
dqn.build_with_env(env)
# load pretrained policy
dqn.load_model("dqn_model.pt")
# start finetuning
dqn.fit_online(env, buffer, explorer, n_steps=100000)
Finetune with Different Algorithm¶
If you want to finetune the saved policy trained offline with online RL algorithms, you can do it in an out-of-the-box way.
# setup offline RL algorithm
cql = d3rlpy.algos.DiscreteCQL()
# train offline
cql.fit(dataset, n_steps=100000)
# transfer to DQN
dqn = d3rlpy.algos.DQN()
dqn.copy_q_function_from(cql)
# start finetuning
dqn.fit_online(env, buffer, explorer, n_steps=100000)
In actor-critic cases, you should also transfer the policy function.
# offline RL
cql = d3rlpy.algos.CQL()
cql.fit(dataset, n_steps=100000)
# transfer to SAC
sac = d3rlpy.algos.SAC()
sac.copy_q_function_from(cql)
sac.copy_policy_from(cql)
# online RL
sac.fit_online(env, buffer, n_steps=100000)
Offline Policy Selection¶
d3rlpy supports offline policy selection by training Fitted Q Evaluation (FQE), which is an offline on-policy RL algorithm. The use of FQE for offline policy selection is proposed by Paine et al.. The concept is that FQE trains Q-function with the trained policy in on-policy manner so that the learned Q-function reflects the expected return of the trained policy. By using the Q-value estimation of FQE, the candidate trained policies can be ranked only with offline dataset. Check Off-Policy Evaluation for more information.
Note
Offline policy selection with FQE is confirmed that it usually works out with discrete action-space policies. However, it seems require some hyperparameter tuning for ranking continuous action-space policies. The more techniques will be supported along with the advancement of this research domain.
Prepare trained policies¶
In this tutorial, let’s train DQN with the built-in CartPole-v0 dataset.
import d3rlpy
# setup replay CartPole-v0 dataset and environment
dataset, env = d3rlpy.datasets.get_dataset("cartpole-replay")
# setup algorithm
dqn = d3rlpy.algos.DQN()
# start offline training
dqn.fit(
dataset,
eval_episodes=dataset.episodes,
n_steps=100000,
n_steps_per_epoch=10000,
scorers={
"environment": d3rlpy.metrics.evaluate_on_environment(env),
},
)
Here is the example result of online evaluation.

Train FQE with the trained policies¶
Next, we train FQE algorithm with the trained policies.
Please note that we use initial_state_value_estimation_scorer
and soft_opc_scorer
proposed in Paine et al..
initial_state_value_estimation_scorer
computes the mean action-value estimation at the initial states.
Thus, if this value for a certain policy is bigger than others, the policy is expected to obtain the higher episode return.
On the other hand, soft_opc_scorer
computes the mean difference between the action-value estimation for the success episodes and the action-value estimation for the all episodes.
If this value for a certain policy is bigger than others, the learned Q-function can clearly tell the difference between the success episodes and others.
import d3rlpy
# setup the same dataset used in policy training
dataset, _ = d3rlpy.datasets.get_dataset("cartpole-replay")
# load pretrained policy
dqn = d3rlpy.algos.DQN()
dqn.build_with_dataset(dataset)
dqn.load_model("d3rlpy_logs/DQN_20220624191141/model_100000.pt")
# setup FQE algorithm
fqe = d3rlpy.ope.DiscreteFQE()
# start FQE training
fqe.fit(
dataset,
eval_episodes=dataset.episodes,
n_steps=10000,
n_steps_per_epoch=1000,
scorers={
"init_value": d3rlpy.metrics.initial_state_value_estimation_scorer,
"soft_opc": d3rlpy.metrics.soft_opc_scorer(180), # set 180 for success return threshold here
},
)
In this example, the policies from epoch 10, epoch 5 and epoch 1 (evaluation episode returns of 107.5, 200.0 and 17.5 respectively) are compared.
The first figure represents the init_value
metrics during FQE training.
As you can see here, the scale of init_value
has correlation with the ranks of evaluation episode returns.

The second figure represents the soft_opc
metrics during FQE training.
These curves also have correlation with the ranks of evaluation episode returns.

Please note that there is usually no convergence in offline RL training due to the non-fixed bootstrapped target.
Use Distributional Q-Function¶
The one of the unique features in d3rlpy is to use distributional Q-functions with arbitrary d3rlpy algorithms. The distributional Q-functions are powerful and potentially capable of improving performance of any algorithms. In this tutorial, you can learn how to use them. Check Q Functions for more information.
Specify by String Alias¶
The supported Q-functions can be specified by string alias. In this case, the default hyper-parameters will be used for the Q-function.
import d3rlpy
# default standard Q-function
sac = d3rlpy.algos.SAC(q_func_factory="mean")
# Quantile Regression Q-function
sac = d3rlpy.algos.SAC(q_func_factory="qr")
# Implicit Quantile Network Q-function
sac = d3rlpy.algos.SAC(q_func_factory="iqn")
Specify by instantiating QFunctionFactory¶
If you want to specify hyper-parameters, you need to instantiate a
QFunctionFactory
object.
# default standard Q-function
mean_q_function = d3rlpy.models.q_functions.MeanQFunctionFactory()
sac = d3rlpy.algos.SAC(q_func_factory=mean_q_function)
# Quantile Regression Q-function
qr_q_function = d3rlpy.models.q_functions.QRQFunctionFactory(n_quantiles=200)
sac = d3rlpy.algos.SAC(q_func_factory=qr_q_function)
# Implicit Quantile Network Q-function
iqn_q_function = d3rlpy.models.q_functions.IQNQFunctionFactory(
n_quantiles=32,
n_greedy_quantiles=64,
embed_size=64,
)
sac = d3rlpy.algos.SAC(q_func_factory=iqn_q_function)
Jupyter Notebooks¶
Software Design¶
In this page, the software design of d3rlpy is explained.
MDPDataset¶

MDPDataset
is a dedicated dataset structure for offline RL.
MDPDataset
automatically structures dataset based on Episode
and
Transition
.
Episode
represents a single episode that includes multiple Transition
objects collected in the episode.
Transition
represents a single tuple experience that consists of
observation
, action
, reward
and next_observation
.
The advantage of this design is that you can split train and test datasets in an episode-wise manner. This feature is specifically useful for the offline RL training since holding out a continuous sequence of data is more making sense unlike a non-sequetial supervised training such as ImageNet classification models.
Regarding the engineering perspective, the underlying transition data is implemented by Cython, a Python-like language compiled to C language, to reduce the computational costs for the memory copies. This Cythonized implementation especially speeds up the cumulative returns for multi-step learning and frame-stacking for pixel observations.
Please check Play with MDPDataset for the tutorial and MDPDataset for the API reference.
Algorithm¶

The implemented algorithms are designed as above.
The algorithm objects have a hierarchical structure where Algorithm
provides the high-level API (e.g. fit
and fit_online
) for users and
AlgorithmImpl
provides the low-level API (e.g. update_actor
and
update_critic
) used in the high-level API.
The advantage of this design is to maximize the reusability of algorithm
logics.
For example, delayed policy update proposed in TD3 reduces the update
frequency of the policy function.
This mechanism can be implemented by changing the frequency of update_actor
method calls in Algorithm
layer without changing the underlying logics.
Algorithm
class takes multiple components that configure the training.
These are the links to the API reference.
Name |
Reference |
Algorithm |
|
EncoderFactory |
|
QFunctionFactory |
|
OptimizerFactory |
|
Scaler |
|
ActionScaler |
|
RewardScaler |
API Reference¶
Algorithms¶
d3rlpy provides state-of-the-art offline deep reinforcement learning algorithms as well as online algorithms for the base implementations.
Continuous control algorithms¶
Behavior Cloning algorithm. |
|
Deep Deterministic Policy Gradients algorithm. |
|
Twin Delayed Deep Deterministic Policy Gradients algorithm. |
|
Soft Actor-Critic algorithm. |
|
Batch-Constrained Q-learning algorithm. |
|
Bootstrapping Error Accumulation Reduction algorithm. |
|
Critic Reguralized Regression algorithm. |
|
Conservative Q-Learning algorithm. |
|
Advantage Weighted Actor-Critic algorithm. |
|
Policy in Latent Action Space algorithm. |
|
Policy in Latent Action Space algorithm with perturbation layer. |
|
TD3+BC algorithm. |
|
Implicit Q-Learning algorithm. |
|
Model-based Offline Policy Optimization. |
|
Conservative Offline Model-Based Optimization. |
|
Random Policy for continuous control algorithm. |
Discrete control algorithms¶
Behavior Cloning algorithm for discrete control. |
|
Neural Fitted Q Iteration algorithm. |
|
Deep Q-Network algorithm. |
|
Double Deep Q-Network algorithm. |
|
Soft Actor-Critic algorithm for discrete action-space. |
|
Discrete version of Batch-Constrained Q-learning algorithm. |
|
Discrete version of Conservative Q-Learning algorithm. |
|
Random Policy for discrete control algorithm. |
Q Functions¶
d3rlpy provides various Q functions including state-of-the-arts, which are
internally used in algorithm objects.
You can switch Q functions by passing q_func_factory
argument at
algorithm initialization.
from d3rlpy.algos import CQL
cql = CQL(q_func_factory='qr') # use Quantile Regression Q function
Also you can change hyper parameters.
from d3rlpy.models.q_functions import QRQFunctionFactory
q_func = QRQFunctionFactory(n_quantiles=32)
cql = CQL(q_func_factory=q_func)
The default Q function is mean
approximator, which estimates expected scalar
action-values.
However, in recent advancements of deep reinforcement learning, the new type
of action-value approximators has been proposed, which is called
distributional Q functions.
Unlike the mean
approximator, the distributional Q functions estimate
distribution of action-values.
This distributional approaches have shown consistently much stronger
performance than the mean
approximator.
Here is a list of available Q functions in the order of performance ascendingly. Currently, as a trade-off between performance and computational complexity, the higher performance requires the more expensive computational costs.
Standard Q function factory class. |
|
Quantile Regression Q function factory class. |
|
Implicit Quantile Network Q function factory class. |
|
Fully parameterized Quantile Function Q function factory. |
MDPDataset¶
d3rlpy provides useful dataset structure for data-driven deep reinforcement learning. In supervised learning, the training script iterates input data \(X\) and label data \(Y\). However, in reinforcement learning, mini-batches consist with sets of \((s_t, a_t, r_t, s_{t+1})\) and episode terminal flags. Converting a set of observations, actions, rewards and terminal flags into this tuples is boring and requires some codings.
Therefore, d3rlpy provides MDPDataset
class which enables you to handle
reinforcement learning datasets without any efforts.
from d3rlpy.dataset import MDPDataset
# 1000 steps of observations with shape of (100,)
observations = np.random.random((1000, 100))
# 1000 steps of actions with shape of (4,)
actions = np.random.random((1000, 4))
# 1000 steps of rewards
rewards = np.random.random(1000)
# 1000 steps of terminal flags
terminals = np.random.randint(2, size=1000)
dataset = MDPDataset(observations, actions, rewards, terminals)
# automatically splitted into d3rlpy.dataset.Episode objects
dataset.episodes
# each episode is also splitted into d3rlpy.dataset.Transition objects
episode = dataset.episodes[0]
episode[0].observation
episode[0].action
episode[0].reward
episode[0].next_observation
episode[0].terminal
# d3rlpy.dataset.Transition object has pointers to previous and next
# transitions like linked list.
transition = episode[0]
while transition.next_transition:
transition = transition.next_transition
# save as HDF5
dataset.dump('dataset.h5')
# load from HDF5
new_dataset = MDPDataset.load('dataset.h5')
Please note that the observations
, actions
, rewards
and terminals
must be aligned with the same timestep.
observations = [s1, s2, s3, ...]
actions = [a1, a2, a3, ...]
rewards = [r1, r2, r3, ...] # r1 = r(s1, a1)
terminals = [t1, t2, t3, ...] # t1 = t(s1, a1)
Markov-Decision Process Dataset class. |
|
Episode class. |
|
Transition class. |
|
mini-batch of Transition objects. |
Datasets¶
d3rlpy provides datasets for experimenting data-driven deep reinforcement learning algorithms.
Returns cartpole dataset and environment. |
|
Returns pendulum dataset and environment. |
|
Returns atari dataset and envrironment. |
|
Returns atari dataset as a list of Transition objects and envrironment. |
|
Returns d4rl dataset and envrironment. |
|
Returns dataset and envrironment by guessing from name. |
Preprocessing¶
Observation¶
d3rlpy provides several preprocessors tightly incorporated with algorithms. Each preprocessor is implemented with PyTorch operation, which will be included in the model exported by save_policy method.
from d3rlpy.algos import CQL
from d3rlpy.dataset import MDPDataset
dataset = MDPDataset(...)
# choose from ['pixel', 'min_max', 'standard'] or None
cql = CQL(scaler='standard')
# scaler is fitted from the given episodes
cql.fit(dataset.episodes)
# preprocesing is included in TorchScript
cql.save_policy('policy.pt')
# you don't need to take care of preprocessing at production
policy = torch.jit.load('policy.pt')
action = policy(unpreprocessed_x)
You can also initialize scalers by yourself.
from d3rlpy.preprocessing import StandardScaler
scaler = StandardScaler(mean=..., std=...)
cql = CQL(scaler=scaler)
Pixel normalization preprocessing. |
|
Min-Max normalization preprocessing. |
|
Standardization preprocessing. |
Action¶
d3rlpy also provides the feature that preprocesses continuous action. With this preprocessing, you don’t need to normalize actions in advance or implement normalization in the environment side.
from d3rlpy.algos import CQL
from d3rlpy.dataset import MDPDataset
dataset = MDPDataset(...)
# 'min_max' or None
cql = CQL(action_scaler='min_max')
# action scaler is fitted from the given episodes
cql.fit(dataset.episodes)
# postprocessing is included in TorchScript
cql.save_policy('policy.pt')
# you don't need to take care of postprocessing at production
policy = torch.jit.load('policy.pt')
action = policy(x)
You can also initialize scalers by yourself.
from d3rlpy.preprocessing import MinMaxActionScaler
action_scaler = MinMaxActionScaler(minimum=..., maximum=...)
cql = CQL(action_scaler=action_scaler)
Min-Max normalization action preprocessing. |
Reward¶
d3rlpy also provides the feature that preprocesses rewards. With this preprocessing, you don’t need to normalize rewards in advance. Note that this preprocessor should be fitted with the dataset. Afterwards you can use it with online training.
from d3rlpy.algos import CQL
from d3rlpy.dataset import MDPDataset
dataset = MDPDataset(...)
# 'min_max', 'standard' or None
cql = CQL(reward_scaler='standard')
# reward scaler is fitted from the given episodes
cql.fit(dataset.episodes)
# reward scaler is also available at finetuning.
cql.fit_online(env)
You can also initialize scalers by yourself.
from d3rlpy.preprocessing import MinMaxRewardScaler
reward_scaler = MinMaxRewardScaler(minimum=..., maximum=...)
cql = CQL(reward_scaler=reward_scaler)
# ClipRewardScaler and MultiplyRewardScaler must be initialized manually
reward_scaler = ClipRewardScaler(-1.0, 1.0)
cql = CQL(reward_scaler=reward_scaler)
Min-Max reward normalization preprocessing. |
|
Reward standardization preprocessing. |
|
Reward clipping preprocessing. |
|
Multiplication reward preprocessing. |
|
Reward normalization preprocessing based on return scale. |
Optimizers¶
d3rlpy provides OptimizerFactory
that gives you flexible control over
optimizers.
OptimizerFactory
takes PyTorch’s optimizer class and its arguments to
initialize, which you can check more here.
from torch.optim import Adam
from d3rlpy.algos import DQN
from d3rlpy.models.optimizers import OptimizerFactory
# modify weight decay
optim_factory = OptimizerFactory(Adam, weight_decay=1e-4)
# set OptimizerFactory
dqn = DQN(optim_factory=optim_factory)
There are also convenient alises.
from d3rlpy.models.optimizers import AdamFactory
# alias for Adam optimizer
optim_factory = AdamFactory(weight_decay=1e-4)
dqn = DQN(optim_factory=optim_factory)
A factory class that creates an optimizer object in a lazy way. |
|
An alias for SGD optimizer. |
|
An alias for Adam optimizer. |
|
An alias for RMSprop optimizer. |
Network Architectures¶
In d3rlpy, the neural network architecture is automatically selected based on
observation shape.
If the observation is image, the algorithm uses the Nature DQN
-based
encoder at each function.
Otherwise, the standard MLP architecture that consists with two linear
layers with 256
hidden units.
Furthermore, d3rlpy provides EncoderFactory
that gives you flexible control
over this neural netowrk architectures.
from d3rlpy.algos import DQN
from d3rlpy.models.encoders import VectorEncoderFactory
# encoder factory
encoder_factory = VectorEncoderFactory(hidden_units=[300, 400], activation='tanh')
# set EncoderFactory
dqn = DQN(encoder_factory=encoder_factory)
You can also build your own encoder factory.
import torch
import torch.nn as nn
from d3rlpy.models.encoders import EncoderFactory
# your own neural network
class CustomEncoder(nn.Module):
def __init__(self, obsevation_shape, feature_size):
self.feature_size = feature_size
self.fc1 = nn.Linear(observation_shape[0], 64)
self.fc2 = nn.Linear(64, feature_size)
def forward(self, x):
h = torch.relu(self.fc1(x))
h = torch.relu(self.fc2(h))
return h
# THIS IS IMPORTANT!
def get_feature_size(self):
return self.feature_size
# your own encoder factory
class CustomEncoderFactory(EncoderFactory):
TYPE = 'custom' # this is necessary
def __init__(self, feature_size):
self.feature_size = feature_size
def create(self, observation_shape):
return CustomEncoder(observation_shape, self.feature_size)
def get_params(self, deep=False):
return {'feature_size': self.feature_size}
dqn = DQN(encoder_factory=CustomEncoderFactory(feature_size=64))
You can also define action-conditioned networks such as Q-functions for continuous
controls.
create
or create_with_action
will be called depending on the function.
class CustomEncoderWithAction(nn.Module):
def __init__(self, obsevation_shape, action_size, feature_size):
self.feature_size = feature_size
self.fc1 = nn.Linear(observation_shape[0] + action_size, 64)
self.fc2 = nn.Linear(64, feature_size)
def forward(self, x, action): # action is also given
h = torch.cat([x, action], dim=1)
h = torch.relu(self.fc1(h))
h = torch.relu(self.fc2(h))
return h
def get_feature_size(self):
return self.feature_size
class CustomEncoderFactory(EncoderFactory):
TYPE = 'custom' # this is necessary
def __init__(self, feature_size):
self.feature_size = feature_size
def create(self, observation_shape):
return CustomEncoder(observation_shape, self.feature_size)
def create_with_action(observation_shape, action_size, discrete_action):
return CustomEncoderWithAction(observation_shape, action_size, self.feature_size)
def get_params(self, deep=False):
return {'feature_size': self.feature_size}
from d3rlpy.algos import SAC
factory = CustomEncoderFactory(feature_size=64)
sac = SAC(actor_encoder_factory=factory, critic_encoder_factory=factory)
If you want from_json
method to load the algorithm configuration including
your encoder configuration, you need to register your encoder factory.
from d3rlpy.models.encoders import register_encoder_factory
# register your own encoder factory
register_encoder_factory(CustomEncoderFactory)
# load algorithm from json
dqn = DQN.from_json('<path-to-json>/params.json')
Once you register your encoder factory, you can specify it via TYPE
value.
dqn = DQN(encoder_factory='custom')
Default encoder factory class. |
|
Pixel encoder factory class. |
|
Vector encoder factory class. |
|
DenseNet encoder factory class. |
Metrics¶
d3rlpy provides scoring functions without compromising scikit-learn compatibility. You can evaluate many metrics with test episodes during training.
from d3rlpy.datasets import get_cartpole
from d3rlpy.algos import DQN
from d3rlpy.metrics.scorer import td_error_scorer
from d3rlpy.metrics.scorer import average_value_estimation_scorer
from d3rlpy.metrics.scorer import evaluate_on_environment
from sklearn.model_selection import train_test_split
dataset, env = get_cartpole()
train_episodes, test_episodes = train_test_split(dataset)
dqn = DQN()
dqn.fit(train_episodes,
eval_episodes=test_episodes,
scorers={
'td_error': td_error_scorer,
'value_scale': average_value_estimation_scorer,
'environment': evaluate_on_environment(env)
})
You can also use them with scikit-learn utilities.
from sklearn.model_selection import cross_validate
scores = cross_validate(dqn,
dataset,
scoring={
'td_error': td_error_scorer,
'environment': evaluate_on_environment(env)
})
Algorithms¶
Returns average TD error. |
|
Returns average of discounted sum of advantage. |
|
Returns average value estimation. |
|
Returns standard deviation of value estimation. |
|
Returns mean estimated action-values at the initial states. |
|
Returns Soft Off-Policy Classification metrics. |
|
Returns squared difference of actions between algorithm and dataset. |
|
Returns percentage of identical actions between algorithm and dataset. |
|
Returns scorer function of evaluation on environment. |
|
Returns scorer function of action difference between algorithms. |
|
Returns scorer function of action matches between algorithms. |
Dynamics¶
|
Returns MSE of observation prediction. |
|
Returns MSE of reward prediction. |
Returns prediction variance of ensemble dynamics. |
Off-Policy Evaluation¶
The off-policy evaluation is a method to estimate the trained policy performance only with offline datasets.
from d3rlpy.algos import CQL
from d3rlpy.datasets import get_pybullet
# prepare the trained algorithm
cql = CQL.from_json('<path-to-json>/params.json')
cql.load_model('<path-to-model>/model.pt')
# dataset to evaluate with
dataset, env = get_pybullet('hopper-bullet-mixed-v0')
from d3rlpy.ope import FQE
# off-policy evaluation algorithm
fqe = FQE(algo=cql)
# metrics to evaluate with
from d3rlpy.metrics.scorer import initial_state_value_estimation_scorer
from d3rlpy.metrics.scorer import soft_opc_scorer
# train estimators to evaluate the trained policy
fqe.fit(dataset.episodes,
eval_episodes=dataset.episodes,
scorers={
'init_value': initial_state_value_estimation_scorer,
'soft_opc': soft_opc_scorer(return_threshold=600)
})
The evaluation during fitting is evaluating the trained policy.
For continuous control algorithms¶
Fitted Q Evaluation. |
For discrete control algorithms¶
Fitted Q Evaluation for discrete action-space. |
Save and Load¶
save_model and load_model¶
from d3rlpy.datasets import get_cartpole
from d3rlpy.algos import DQN
dataset, env = get_cartpole()
dqn = DQN()
dqn.fit(dataset.episodes, n_epochs=1)
# save entire model parameters.
dqn.save_model('model.pt')
save_model
method saves all parameters including optimizer states, which is
useful when checking all the outputs or re-training from snapshots.
Once you save your model, you can load it via load_model
method. Before
loading the model, the algorithm object must be initialized as follows.
dqn = DQN()
# initialize with dataset
dqn.build_with_dataset(dataset)
# initialize with environment
# dqn.build_with_env(env)
# load entire model parameters.
dqn.load_model('model.pt')
from_json¶
It is very boring to set the same hyperparameters to initialize algorithms when
loading model parameters.
In d3rlpy, params.json
is saved at the beggining of fit
method, which
includes all hyperparameters within the algorithm object.
You can recreate algorithm objects from params.json
via from_json
method.
from d3rlpy.algos import DQN
dqn = DQN.from_json('d3rlpy_logs/<path-to-json>/params.json')
# ready to load
dqn.load_model('model.pt')
save_policy¶
save_policy
method saves the only greedy-policy computation graph as
TorchScript or ONNX.
When save_policy
method is called, the greedy-policy graph is constructed
and traced via torch.jit.trace
function.
from d3rlpy.datasets import get_cartpole
from d3rlpy.algos import DQN
dataset, env = get_cartpole()
dqn = DQN()
dqn.fit(dataset.episodes, n_epochs=1)
# save greedy-policy as TorchScript
dqn.save_policy('policy.pt')
# save greedy-policy as ONNX
dqn.save_policy('policy.onnx')
TorchScript¶
TorchScript is a optimizable graph expression provided by PyTorch. The saved policy can be loaded without any dependencies except PyTorch.
import torch
# load greedy-policy only with PyTorch
policy = torch.jit.load('policy.pt')
# returns greedy actions
actions = policy(torch.rand(32, 6))
This is especially useful when deploying the trained models to productions. The computation can be faster and you don’t need to install d3rlpy. Moreover, TorchScript model can be easily loaded even with C++, which will empower your robotics and embedding system projects.
#include <torch/script.h>
int main(int argc, char* argv[]) {
torch::jit::script::Module module;
try {
module = torch::jit::load("policy.pt")
} catch (const c10::Error& e) {
return -1;
}
return 0;
}
You can get more information about TorchScript here.
ONNX¶
ONNX is an open format built to represent machine learning models. This is also useful when deploying the trained model to productions with various programming languages including Python, C++, JavaScript and more.
The following example is written with onnxruntime.
import onnxruntime as ort
# load ONNX policy via onnxruntime
ort_session = ort.InferenceSession('policy.onnx')
# observation
observation = np.random.rand(1, 6).astype(np.float32)
# returns greedy action
action = ort_session.run(None, {'input_0': observation})[0]
You can get more information about ONNX here.
Logging¶
d3rlpy algorithms automatically save model parameters and metrics under d3rlpy_logs directory.
from d3rlpy.datasets import get_cartpole
from d3rlpy.algos import DQN
dataset, env = get_cartpole()
dqn = DQN()
# metrics and parameters are saved in `d3rlpy_logs/DQN_YYYYMMDDHHmmss`
dqn.fit(dataset.episodes)
You can designate the directory.
# the directory will be `custom_logs/custom_YYYYMMDDHHmmss`
dqn.fit(dataset.episodes, logdir='custom_logs', experiment_name='custom')
If you want to disable all loggings, you can pass save_metrics=False.
dqn.fit(dataset.episodes, save_metrics=False)
TensorBoard¶
The same information can be also automatically saved for tensorboard under the specified directory so that you can interactively visualize training metrics easily.
$ pip install tensorboard
$ tensorboard --logdir runs
This tensorboard logs can be enabled by passing tensorboard_dir=/path/to/log_dir.
# saving tensorboard data is disabled by default
dqn.fit(dataset.episodes, tensorboard_dir='runs')
Online Training¶
Standard Training¶
d3rlpy provides not only offline training, but also online training utilities. Despite being designed for offline training algorithms, d3rlpy is flexible enough to be trained in an online manner with a few more utilities.
import gym
from d3rlpy.algos import DQN
from d3rlpy.online.buffers import ReplayBuffer
from d3rlpy.online.explorers import LinearDecayEpsilonGreedy
# setup environment
env = gym.make('CartPole-v0')
eval_env = gym.make('CartPole-v0')
# setup algorithm
dqn = DQN(batch_size=32,
learning_rate=2.5e-4,
target_update_interval=100,
use_gpu=True)
# setup replay buffer
buffer = ReplayBuffer(maxlen=1000000, env=env)
# setup explorers
explorer = LinearDecayEpsilonGreedy(start_epsilon=1.0,
end_epsilon=0.1,
duration=10000)
# start training
dqn.fit_online(env,
buffer,
explorer=explorer, # you don't need this with probablistic policy algorithms
eval_env=eval_env,
n_steps=30000, # the number of total steps to train.
n_steps_per_epoch=1000,
update_interval=10) # update parameters every 10 steps.
Replay Buffer¶
Standard Replay Buffer. |
Explorers¶
\(\epsilon\)-greedy explorer with constant \(\epsilon\). |
|
\(\epsilon\)-greedy explorer with linear decay schedule. |
|
Normal noise explorer. |
(experimental) Model-based Algorithms¶
d3rlpy provides model-based reinforcement learning algorithms.
from d3rlpy.datasets import get_pendulum
from d3rlpy.dynamics import ProbabilisticEnsembleDynamics
from d3rlpy.metrics.scorer import dynamics_observation_prediction_error_scorer
from d3rlpy.metrics.scorer import dynamics_reward_prediction_error_scorer
from d3rlpy.metrics.scorer import dynamics_prediction_variance_scorer
from sklearn.model_selection import train_test_split
dataset, _ = get_pendulum()
train_episodes, test_episodes = train_test_split(dataset)
dynamics = d3rlpy.dynamics.ProbabilisticEnsembleDynamics(learning_rate=1e-4, use_gpu=True)
# same as algorithms
dynamics.fit(train_episodes,
eval_episodes=test_episodes,
n_epochs=100,
scorers={
'observation_error': dynamics_observation_prediction_error_scorer,
'reward_error': dynamics_reward_prediction_error_scorer,
'variance': dynamics_prediction_variance_scorer,
})
Pick the best model and pass it to the model-based RL algorithm.
from d3rlpy.algos import MOPO
# load trained dynamics model
dynamics = ProbabilisticEnsembleDynamics.from_json('<path-to-params.json>/params.json')
dynamics.load_model('<path-to-model>/model_xx.pt')
# give mopo as generator argument.
mopo = MOPO(dynamics=dynamics)
Dynamics Model¶
Probabilistic ensemble dynamics. |
Stable-Baselines3 Wrapper¶
d3rlpy provides a minimal wrapper to use Stable-Baselines3 (SB3) features, like utility helpers or SB3 algorithms to create datasets.
Note
This wrapper is far from complete, and only provide a minimal integration with SB3.
Convert SB3 replay buffer to d3rlpy dataset¶
A replay buffer from Stable-Baselines3 can be easily converted to a d3rlpy.dataset.MDPDataset
using to_mdp_dataset()
utility function.
import stable_baselines3 as sb3
from d3rlpy.algos import CQL
from d3rlpy.wrappers.sb3 import to_mdp_dataset
# Train an off-policy agent with SB3
model = sb3.SAC("MlpPolicy", "Pendulum-v0", learning_rate=1e-3, verbose=1)
model.learn(6000)
# Convert to d3rlpy MDPDataset
dataset = to_mdp_dataset(model.replay_buffer)
# The dataset can then be used to train a d3rlpy model
offline_model = CQL()
offline_model.fit(dataset.episodes, n_epochs=100)
Convert d3rlpy to use SB3 helpers¶
An agent from d3rlpy can be converted to use the SB3 interface (notably follow the interface of SB3 predict()
).
This allows to use SB3 helpers like evaluate_policy
.
import gym
from stable_baselines3.common.evaluation import evaluate_policy
from d3rlpy.algos import AWAC
from d3rlpy.wrappers.sb3 import SB3Wrapper
env = gym.make("Pendulum-v0")
# Define an offline RL model
offline_model = AWAC()
# Train it using for instance a dataset created by a SB3 agent (see above)
offline_model.fit(dataset.episodes, n_epochs=10)
# Use SB3 wrapper (convert `predict()` method to follow SB3 API)
# to have access to SB3 helpers
# d3rlpy model is accessible via `wrapped_model.algo`
wrapped_model = SB3Wrapper(offline_model)
observation = env.reset()
# We can now use SB3's predict style
# it returns the action and the hidden states (for RNN policies)
action, _ = wrapped_model.predict([observation], deterministic=True)
# The following is equivalent to offline_model.sample_action(obs)
action, _ = wrapped_model.predict([observation], deterministic=False)
# Evaluate the trained model using SB3 helper
mean_reward, std_reward = evaluate_policy(wrapped_model, env)
print(f"mean_reward={mean_reward} +/- {std_reward}")
# Call methods from the wrapped d3rlpy model
wrapped_model.sample_action([observation])
wrapped_model.fit(dataset.episodes, n_epochs=10)
# Set attributes
wrapped_model.n_epochs = 2
# wrapped_model.n_epochs points to d3rlpy wrapped_model.algo.n_epochs
assert wrapped_model.algo.n_epochs == 2
Command Line Interface¶
d3rlpy provides the convenient CLI tool.
plot¶
Plot the saved metrics by specifying paths:
$ d3rlpy plot <path> [<path>...]
option |
description |
---|---|
|
moving average window. |
|
use iterations on x-axis. |
|
show maximum value. |
|
label in legend. |
|
limit on x-axis (tuple). |
|
limit on y-axis (tuple). |
|
title of the plot. |
|
flag to save the plot as an image. |
example:
$ d3rlpy plot d3rlpy_logs/CQL_20201224224314/environment.csv

plot-all¶
Plot the all metrics saved in the directory:
$ d3rlpy plot-all <path>
example:
$ d3rlpy plot-all d3rlpy_logs/CQL_20201224224314

export¶
Export the saved model to the inference format, onnx
and torchscript
:
$ d3rlpy export <path>
option |
description |
---|---|
|
model format (torchscript, onnx). |
|
explicitly specify params.json. |
|
output path. |
example:
$ d3rlpy export d3rlpy_logs/CQL_20201224224314/model_100.pt
record¶
Record evaluation episodes as videos with the saved model:
$ d3rlpy record <path> --env-id <environment id>
option |
description |
---|---|
|
Gym environment id. |
|
arbitrary Python code to define environment to evaluate. |
|
output directory. |
|
explicitly specify params.json |
|
the number of episodes to record. |
|
video frame rate. |
|
images are recored every |
|
\(\epsilon\)-greedy evaluation. |
example:
# record simple environment
$ d3rlpy record d3rlpy_logs/CQL_20201224224314/model_100.pt --env-id HopperBulletEnv-v0
# record wrapped environment
$ d3rlpy record d3rlpy_logs/Discrete_CQL_20201224224314/model_100.pt \
--env-header 'import gym; from d3rlpy.envs import Atari; env = Atari(gym.make("BreakoutNoFrameskip-v4"), is_eval=True)'
play¶
Run evaluation episodes with rendering:
$ d3rlpy play <path> --env-id <environment id>
option |
description |
---|---|
|
Gym environment id. |
|
arbitrary Python code to define environment to evaluate. |
|
explicitly specify params.json |
|
the number of episodes to run. |
example:
# record simple environment
$ d3rlpy play d3rlpy_logs/CQL_20201224224314/model_100.pt --env-id HopperBulletEnv-v0
# record wrapped environment
$ d3rlpy play d3rlpy_logs/Discrete_CQL_20201224224314/model_100.pt \
--env-header 'import gym; from d3rlpy.envs import Atari; env = Atari(gym.make("BreakoutNoFrameskip-v4"), is_eval=True)'
Installation¶
Recommended Platforms¶
d3rlpy supports Linux, macOS and also Windows.
Install d3rlpy¶
Install via PyPI¶
pip is a recommended way to install d3rlpy:
$ pip install d3rlpy
Install via Anaconda¶
d3rlpy is also available on conda-forge:
$ conda install -c conda-forge d3rlpy
Install via Docker¶
d3rlpy is also available on Docker Hub:
$ docker run -it --gpus all --name d3rlpy takuseno/d3rlpy:latest bash
Install from source¶
You can also install via GitHub repository:
$ git clone https://github.com/takuseno/d3rlpy
$ cd d3rlpy
$ pip install Cython numpy # if you have not installed them.
$ pip install -e .
Tips¶
Reproducibility¶
Reproducibility is one of the most important things when doing research activity. Here is a simple example in d3rlpy.
import d3rlpy
import gym
# set random seeds in random module, numpy module and PyTorch module.
d3rlpy.seed(313)
# set environment seed
env = gym.make('Hopper-v2')
env.seed(313)
Learning from image observation¶
d3rlpy supports both vector observations and image observations. There are several things you need to care about if you want to train RL agents from image observations.
from d3rlpy.dataset import MDPDataset
# observation MUST be uint8 array, and the channel-first images
observations = np.random.randint(256, size=(100000, 1, 84, 84), dtype=np.uint8)
actions = np.random.randomint(4, size=100000)
rewards = np.random.random(100000)
terminals = np.random.randint(2, size=100000)
dataset = MDPDataset(observations, actions, rewards, terminals)
from d3rlpy.algos import DQN
dqn = DQN(scaler='pixel', # you MUST set pixel scaler
n_frames=4) # you CAN set the number of frames to stack
Improve performance beyond the original paper¶
d3rlpy provides many options that you can use to improve performance potentially beyond the original paper. All the options are powerful, but the best combinations and hyperparameters are always dependent on the tasks.
from d3rlpy.models.encoders import DefaultEncoderFactory
from d3rlpy.models.q_functions import QRQFunctionFactory
from d3rlpy.algos import DQN, SAC
# use batch normalization
# this seems to improve performance with discrete action-spaces
encoder = DefaultEncoderFactory(use_batch_norm=True)
dqn = DQN(encoder_factory=encoder,
n_critics=5, # Q function ensemble size
n_steps=5, # N-step TD backup
q_func_factory='qr') # use distributional Q function
# use dropout
# this will dramatically improve performance
encoder = DefaultEncoderFactory(dropout_rate=0.2)
sac = SAC(actor_encoder_factory=encoder)
Paper Reproductions¶
For the experiment code, please take a look at reproductions directory.
All the experimental results are available in d3rlpy-benchmarks repository.
License¶
MIT License
Copyright (c) 2021 Takuma Seno
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.