Algorithms

d3rlpy provides state-of-the-art offline deep reinforcement learning algorithms as well as online algorithms for the base implementations.

Each algorithm provides its config class and you can instantiate it with specifying a device to use.

import d3rlpy

# instantiate algorithm with CPU
sac = d3rlpy.algos.SACConfig().create(device="cpu:0")
# instantiate algorithm with GPU
sac = d3rlpy.algos.SACConfig().create(device="cuda:0")
# instantiate algorithm with the 2nd GPU
sac = d3rlpy.algos.SACConfig().create(device="cuda:1")

You can also check advanced use cases at examples directory.

Base

LearnableBase

The base class of all algorithms.

class d3rlpy.base.LearnableBase(config, device, impl=None)[source]

Bases: Generic[d3rlpy.base.TImpl_co, d3rlpy.base.TConfig_co]

property action_scaler: Optional[d3rlpy.preprocessing.action_scalers.ActionScaler]

Preprocessing action scaler.

Returns

preprocessing action scaler.

Return type

Optional[ActionScaler]

property action_size: Optional[int]

Action size.

Returns

action size.

Return type

Optional[int]

property batch_size: int

Batch size to train.

Returns

batch size.

Return type

int

build_with_dataset(dataset)[source]

Instantiate implementation object with ReplayBuffer object.

Parameters

dataset (d3rlpy.dataset.replay_buffer.ReplayBuffer) – dataset.

Return type

None

build_with_env(env)[source]

Instantiate implementation object with OpenAI Gym object.

Parameters

env (Union[gym.core.Env[Any, Any], gymnasium.core.Env[Any, Any]]) – gym-like environment.

Return type

None

property config: d3rlpy.base.TConfig_co

Config.

Returns

config.

Return type

LearnableConfig

create_impl(observation_shape, action_size)[source]

Instantiate implementation objects with the dataset shapes.

This method will be used internally when fit method is called.

Parameters
  • observation_shape (Union[Sequence[int], Sequence[Sequence[int]]]) – observation shape.

  • action_size (int) – dimension of action-space.

Return type

None

classmethod from_json(fname, device=False)[source]

Construct algorithm from params.json file.

from d3rlpy.algos import CQL

cql = CQL.from_json("<path-to-json>", device='cuda:0')
Parameters
  • fname (str) – path to params.json

  • device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

typing_extensions.Self

property gamma: float

Discount factor.

Returns

discount factor.

Return type

float

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

property grad_step: int

Total gradient step counter.

This value will keep counting after fit and fit_online methods finish.

Returns

total gradient step counter.

property impl: Optional[d3rlpy.base.TImpl_co]

Implementation object.

Returns

implementation object.

Return type

Optional[ImplBase]

load_model(fname)[source]

Load neural network parameters.

algo.load_model('model.pt')
Parameters

fname (str) – source file path.

Return type

None

property observation_scaler: Optional[d3rlpy.preprocessing.observation_scalers.ObservationScaler]

Preprocessing observation scaler.

Returns

preprocessing observation scaler.

Return type

Optional[ObservationScaler]

property observation_shape: Optional[Union[Sequence[int], Sequence[Sequence[int]]]]

Observation shape.

Returns

observation shape.

Return type

Optional[Sequence[int]]

property reward_scaler: Optional[d3rlpy.preprocessing.reward_scalers.RewardScaler]

Preprocessing reward scaler.

Returns

preprocessing reward scaler.

Return type

Optional[RewardScaler]

save(fname)[source]

Saves paired data of neural network parameters and serialized config.

algo.save('model.d3')

# reconstruct everything
algo2 = d3rlpy.load_learnable("model.d3", device="cuda:0")
Parameters

fname (str) – destination file path.

Return type

None

save_model(fname)[source]

Saves neural network parameters.

algo.save_model('model.pt')
Parameters

fname (str) – destination file path.

Return type

None

set_grad_step(grad_step)[source]

Set total gradient step counter.

This method can be used to restart the middle of training with an arbitrary gradient step counter, which has effects on periodic functions such as the target update.

Parameters

grad_step (int) – total gradient step counter.

Return type

None

Q-learning

QLearningAlgoBase

The base class of Q-learning algorithms.

class d3rlpy.algos.QLearningAlgoBase(config, device, impl=None)[source]

Bases: Generic[d3rlpy.algos.qlearning.base.TQLearningImpl, d3rlpy.algos.qlearning.base.TQLearningConfig], d3rlpy.base.LearnableBase[d3rlpy.algos.qlearning.base.TQLearningImpl, d3rlpy.algos.qlearning.base.TQLearningConfig]

collect(env, buffer=None, explorer=None, deterministic=False, n_steps=1000000, show_progress=True)[source]

Collects data via interaction with environment.

If buffer is not given, ReplayBuffer will be internally created.

Parameters
  • env (Union[gym.core.Env[Any, Any], gymnasium.core.Env[Any, Any]]) – Fym-like environment.

  • buffer (Optional[d3rlpy.dataset.replay_buffer.ReplayBuffer]) – Replay buffer.

  • explorer (Optional[d3rlpy.algos.qlearning.explorers.Explorer]) – Action explorer.

  • deterministic (bool) – Flag to collect data with the greedy policy.

  • n_steps (int) – Number of total steps to train.

  • show_progress (bool) – Flag to show progress bar for iterations.

Returns

Replay buffer with the collected data.

Return type

d3rlpy.dataset.replay_buffer.ReplayBuffer

copy_policy_from(algo)[source]

Copies policy parameters from the given algorithm.

# pretrain with static dataset
cql = d3rlpy.algos.CQL()
cql.fit(dataset, n_steps=100000)

# transfer to online algorithm
sac = d3rlpy.algos.SAC()
sac.create_impl(cql.observation_shape, cql.action_size)
sac.copy_policy_from(cql)
Parameters

algo (d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.base.QLearningAlgoImplBase, d3rlpy.base.LearnableConfig]) – Algorithm object.

Return type

None

copy_policy_optim_from(algo)[source]

Copies policy optimizer states from the given algorithm.

# pretrain with static dataset
cql = d3rlpy.algos.CQL()
cql.fit(dataset, n_steps=100000)

# transfer to online algorithm
sac = d3rlpy.algos.SAC()
sac.create_impl(cql.observation_shape, cql.action_size)
sac.copy_policy_optim_from(cql)
Parameters

algo (d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.base.QLearningAlgoImplBase, d3rlpy.base.LearnableConfig]) – Algorithm object.

Return type

None

copy_q_function_from(algo)[source]

Copies Q-function parameters from the given algorithm.

# pretrain with static dataset
cql = d3rlpy.algos.CQL()
cql.fit(dataset, n_steps=100000)

# transfer to online algorithmn
sac = d3rlpy.algos.SAC()
sac.create_impl(cql.observation_shape, cql.action_size)
sac.copy_q_function_from(cql)
Parameters

algo (d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.base.QLearningAlgoImplBase, d3rlpy.base.LearnableConfig]) – Algorithm object.

Return type

None

copy_q_function_optim_from(algo)[source]

Copies Q-function optimizer states from the given algorithm.

# pretrain with static dataset
cql = d3rlpy.algos.CQL()
cql.fit(dataset, n_steps=100000)

# transfer to online algorithm
sac = d3rlpy.algos.SAC()
sac.create_impl(cql.observation_shape, cql.action_size)
sac.copy_policy_optim_from(cql)
Parameters

algo (d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.base.QLearningAlgoImplBase, d3rlpy.base.LearnableConfig]) – Algorithm object.

Return type

None

fit(dataset, n_steps, n_steps_per_epoch=10000, experiment_name=None, with_timestamp=True, logger_adapter=<d3rlpy.logging.file_adapter.FileAdapterFactory object>, show_progress=True, save_interval=1, evaluators=None, callback=None, epoch_callback=None)[source]

Trains with given dataset.

algo.fit(episodes, n_steps=1000000)
Parameters
  • dataset (d3rlpy.dataset.replay_buffer.ReplayBuffer) – ReplayBuffer object.

  • n_steps (int) – Number of steps to train.

  • n_steps_per_epoch (int) – Number of steps per epoch. This value will be ignored when n_steps is None.

  • experiment_name (Optional[str]) – Experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.

  • with_timestamp (bool) – Flag to add timestamp string to the last of directory name.

  • logger_adapter (d3rlpy.logging.logger.LoggerAdapterFactory) – LoggerAdapterFactory object.

  • show_progress (bool) – Flag to show progress bar for iterations.

  • save_interval (int) – Interval to save parameters.

  • evaluators (Optional[Dict[str, d3rlpy.metrics.evaluators.EvaluatorProtocol]]) – List of evaluators.

  • callback (Optional[Callable[[typing_extensions.Self, int, int], None]]) – Callable function that takes (algo, epoch, total_step) , which is called every step.

  • epoch_callback (Optional[Callable[[typing_extensions.Self, int, int], None]]) – Callable function that takes (algo, epoch, total_step), which is called at the end of every epoch.

Returns

List of result tuples (epoch, metrics) per epoch.

Return type

List[Tuple[int, Dict[str, float]]]

fit_online(env, buffer=None, explorer=None, n_steps=1000000, n_steps_per_epoch=10000, update_interval=1, update_start_step=0, random_steps=0, eval_env=None, eval_epsilon=0.0, save_interval=1, experiment_name=None, with_timestamp=True, logger_adapter=<d3rlpy.logging.file_adapter.FileAdapterFactory object>, show_progress=True, callback=None)[source]

Start training loop of online deep reinforcement learning.

Parameters
  • env (Union[gym.core.Env[Any, Any], gymnasium.core.Env[Any, Any]]) – Gym-like environment.

  • buffer (Optional[d3rlpy.dataset.replay_buffer.ReplayBuffer]) – Replay buffer.

  • explorer (Optional[d3rlpy.algos.qlearning.explorers.Explorer]) – Action explorer.

  • n_steps (int) – Number of total steps to train.

  • n_steps_per_epoch (int) – Number of steps per epoch.

  • update_interval (int) – Number of steps per update.

  • update_start_step (int) – Steps before starting updates.

  • random_steps (int) – Steps for the initial random explortion.

  • eval_env (Optional[Union[gym.core.Env[Any, Any], gymnasium.core.Env[Any, Any]]]) – Gym-like environment. If None, evaluation is skipped.

  • eval_epsilon (float) – \(\epsilon\)-greedy factor during evaluation.

  • save_interval (int) – Number of epochs before saving models.

  • experiment_name (Optional[str]) – Experiment name for logging. If not passed, the directory name will be {class name}_online_{timestamp}.

  • with_timestamp (bool) – Flag to add timestamp string to the last of directory name.

  • logger_adapter (d3rlpy.logging.logger.LoggerAdapterFactory) – LoggerAdapterFactory object.

  • show_progress (bool) – Flag to show progress bar for iterations.

  • callback (Optional[Callable[[typing_extensions.Self, int, int], None]]) – Callable function that takes (algo, epoch, total_step) , which is called at the end of epochs.

Return type

None

fitter(dataset, n_steps, n_steps_per_epoch=10000, experiment_name=None, with_timestamp=True, logger_adapter=<d3rlpy.logging.file_adapter.FileAdapterFactory object>, show_progress=True, save_interval=1, evaluators=None, callback=None, epoch_callback=None)[source]

Iterate over epochs steps to train with the given dataset. At each iteration algo methods and properties can be changed or queried.

for epoch, metrics in algo.fitter(episodes):
    my_plot(metrics)
    algo.save_model(my_path)
Parameters
  • dataset (d3rlpy.dataset.replay_buffer.ReplayBuffer) – Offline dataset to train.

  • n_steps (int) – Number of steps to train.

  • n_steps_per_epoch (int) – Number of steps per epoch. This value will be ignored when n_steps is None.

  • experiment_name (Optional[str]) – Experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.

  • with_timestamp (bool) – Flag to add timestamp string to the last of directory name.

  • logger_adapter (d3rlpy.logging.logger.LoggerAdapterFactory) – LoggerAdapterFactory object.

  • show_progress (bool) – Flag to show progress bar for iterations.

  • save_interval (int) – Interval to save parameters.

  • evaluators (Optional[Dict[str, d3rlpy.metrics.evaluators.EvaluatorProtocol]]) – List of evaluators.

  • callback (Optional[Callable[[typing_extensions.Self, int, int], None]]) – Callable function that takes (algo, epoch, total_step) , which is called every step.

  • epoch_callback (Optional[Callable[[typing_extensions.Self, int, int], None]]) – Callable function that takes (algo, epoch, total_step), which is called at the end of every epoch.

Returns

Iterator yielding current epoch and metrics dict.

Return type

Generator[Tuple[int, Dict[str, float]], None, None]

predict(x)[source]

Returns greedy actions.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

actions = algo.predict(x)
# actions.shape == (100, action size) for continuous control
# actions.shape == (100,) for discrete control
Parameters

x (Union[numpy.ndarray[Any, numpy.dtype[Any]], Sequence[numpy.ndarray[Any, numpy.dtype[Any]]]]) – Observations

Returns

Greedy actions

Return type

numpy.ndarray[Any, numpy.dtype[Any]]

predict_value(x, action)[source]

Returns predicted action-values.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

# for continuous control
# 100 actions with shape of (2,)
actions = np.random.random((100, 2))

# for discrete control
# 100 actions in integer values
actions = np.random.randint(2, size=100)

values = algo.predict_value(x, actions)
# values.shape == (100,)
Parameters
Returns

Predicted action-values

Return type

numpy.ndarray[Any, numpy.dtype[Any]]

reset_optimizer_states()[source]

Resets optimizer states.

This is especially useful when fine-tuning policies with setting inital optimizer states.

Return type

None

sample_action(x)[source]

Returns sampled actions.

The sampled actions are identical to the output of predict method if the policy is deterministic.

Parameters

x (Union[numpy.ndarray[Any, numpy.dtype[Any]], Sequence[numpy.ndarray[Any, numpy.dtype[Any]]]]) – Observations.

Returns

Sampled actions.

Return type

numpy.ndarray[Any, numpy.dtype[Any]]

save_policy(fname)[source]

Save the greedy-policy computational graph as TorchScript or ONNX.

The format will be automatically detected by the file name.

# save as TorchScript
algo.save_policy('policy.pt')

# save as ONNX
algo.save_policy('policy.onnx')

The artifacts saved with this method will work without d3rlpy. This method is especially useful to deploy the learned policy to production environments or embedding systems.

See also

Parameters

fname (str) – Destination file path.

Return type

None

update(batch)[source]

Update parameters with mini-batch of data.

Parameters

batch (d3rlpy.dataset.mini_batch.TransitionMiniBatch) – Mini-batch data.

Returns

Dictionary of metrics.

Return type

Dict[str, float]

BC

class d3rlpy.algos.BCConfig(batch_size=100, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, learning_rate=0.001, policy_type='deterministic', optim_factory=<factory>, encoder_factory=<factory>)[source]

Bases: d3rlpy.base.LearnableConfig

Config of Behavior Cloning algorithm.

Behavior Cloning (BC) is to imitate actions in the dataset via a supervised learning approach. Since BC is only imitating action distributions, the performance will be close to the mean of the dataset even though BC mostly works better than online RL algorithms.

\[L(\theta) = \mathbb{E}_{a_t, s_t \sim D} [(a_t - \pi_\theta(s_t))^2]\]
Parameters
  • learning_rate (float) – Learing rate.

  • optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory.

  • encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory.

  • batch_size (int) – Mini-batch size.

  • policy_type (str) – the policy type. Available options are ['deterministic', 'stochastic'].

  • observation_scaler (d3rlpy.preprocessing.ObservationScaler) – Observation preprocessor.

  • action_scaler (d3rlpy.preprocessing.ActionScaler) – Action preprocessor.

  • gamma (float) –

  • reward_scaler (Optional[d3rlpy.preprocessing.reward_scalers.RewardScaler]) –

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.qlearning.bc.BC

class d3rlpy.algos.BC(config, device, impl=None)[source]

Bases: d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.torch.bc_impl.BCBaseImpl, d3rlpy.algos.qlearning.bc.BCConfig]

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

DiscreteBC

class d3rlpy.algos.DiscreteBCConfig(batch_size=100, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, learning_rate=0.001, optim_factory=<factory>, encoder_factory=<factory>, beta=0.5)[source]

Bases: d3rlpy.base.LearnableConfig

Config of Behavior Cloning algorithm for discrete control.

Behavior Cloning (BC) is to imitate actions in the dataset via a supervised learning approach. Since BC is only imitating action distributions, the performance will be close to the mean of the dataset even though BC mostly works better than online RL algorithms.

\[L(\theta) = \mathbb{E}_{a_t, s_t \sim D} [-\sum_a p(a|s_t) \log \pi_\theta(a|s_t)]\]

where \(p(a|s_t)\) is implemented as a one-hot vector.

Parameters
  • learning_rate (float) – Learing rate.

  • optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory.

  • encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory.

  • batch_size (int) – Mini-batch size.

  • beta (float) – Reguralization factor.

  • observation_scaler (d3rlpy.preprocessing.ObservationScaler) – Observation preprocessor.

  • gamma (float) –

  • action_scaler (Optional[d3rlpy.preprocessing.action_scalers.ActionScaler]) –

  • reward_scaler (Optional[d3rlpy.preprocessing.reward_scalers.RewardScaler]) –

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.qlearning.bc.DiscreteBC

class d3rlpy.algos.DiscreteBC(config, device, impl=None)[source]

Bases: d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.torch.bc_impl.BCBaseImpl, d3rlpy.algos.qlearning.bc.DiscreteBCConfig]

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

NFQ

class d3rlpy.algos.NFQConfig(batch_size=32, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, learning_rate=6.25e-05, optim_factory=<factory>, encoder_factory=<factory>, q_func_factory=<factory>, n_critics=1)[source]

Bases: d3rlpy.base.LearnableConfig

Config of Neural Fitted Q Iteration algorithm.

This NFQ implementation in d3rlpy is practically same as DQN, but excluding the target network mechanism.

\[L(\theta) = \mathbb{E}_{s_t, a_t, r_{t+1}, s_{t+1} \sim D} [(r_{t+1} + \gamma \max_a Q_{\theta'}(s_{t+1}, a) - Q_\theta(s_t, a_t))^2]\]

where \(\theta'\) is the target network parameter. The target network parameter is synchronized every target_update_interval iterations.

References

Parameters
  • observation_scaler (d3rlpy.preprocessing.ObservationScaler) – Observation preprocessor.

  • reward_scaler (d3rlpy.preprocessing.RewardScaler) – Reward preprocessor.

  • learning_rate (float) – Learning rate.

  • optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory.

  • encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory.

  • q_func_factory (d3rlpy.models.q_functions.QFunctionFactory) – Q function factory.

  • batch_size (int) – Mini-batch size.

  • gamma (float) – Discount factor.

  • n_critics (int) – Number of Q functions for ensemble.

  • action_scaler (Optional[d3rlpy.preprocessing.action_scalers.ActionScaler]) –

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.qlearning.nfq.NFQ

class d3rlpy.algos.NFQ(config, device, impl=None)[source]

Bases: d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.torch.dqn_impl.DQNImpl, d3rlpy.algos.qlearning.nfq.NFQConfig]

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

DQN

class d3rlpy.algos.DQNConfig(batch_size=32, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, learning_rate=6.25e-05, optim_factory=<factory>, encoder_factory=<factory>, q_func_factory=<factory>, n_critics=1, target_update_interval=8000)[source]

Bases: d3rlpy.base.LearnableConfig

Config of Deep Q-Network algorithm.

\[L(\theta) = \mathbb{E}_{s_t, a_t, r_{t+1}, s_{t+1} \sim D} [(r_{t+1} + \gamma \max_a Q_{\theta'}(s_{t+1}, a) - Q_\theta(s_t, a_t))^2]\]

where \(\theta'\) is the target network parameter. The target network parameter is synchronized every target_update_interval iterations.

References

Parameters
  • observation_scaler (d3rlpy.preprocessing.ObservationScaler) – Observation preprocessor.

  • reward_scaler (d3rlpy.preprocessing.RewardScaler) – Reward preprocessor.

  • learning_rate (float) – Learning rate.

  • optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory.

  • encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory.

  • q_func_factory (d3rlpy.models.q_functions.QFunctionFactory) – Q function factory.

  • batch_size (int) – Mini-batch size.

  • gamma (float) – Discount factor.

  • n_critics (int) – Number of Q functions for ensemble.

  • target_update_interval (int) – Interval to update the target network.

  • action_scaler (Optional[d3rlpy.preprocessing.action_scalers.ActionScaler]) –

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.qlearning.dqn.DQN

class d3rlpy.algos.DQN(config, device, impl=None)[source]

Bases: d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.torch.dqn_impl.DQNImpl, d3rlpy.algos.qlearning.dqn.DQNConfig]

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

DoubleDQN

class d3rlpy.algos.DoubleDQNConfig(batch_size=32, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, learning_rate=6.25e-05, optim_factory=<factory>, encoder_factory=<factory>, q_func_factory=<factory>, n_critics=1, target_update_interval=8000)[source]

Bases: d3rlpy.algos.qlearning.dqn.DQNConfig

Config of Double Deep Q-Network algorithm.

The difference from DQN is that the action is taken from the current Q function instead of the target Q function. This modification significantly decreases overestimation bias of TD learning.

\[L(\theta) = \mathbb{E}_{s_t, a_t, r_{t+1}, s_{t+1} \sim D} [(r_{t+1} + \gamma Q_{\theta'}(s_{t+1}, \text{argmax}_a Q_\theta(s_{t+1}, a)) - Q_\theta(s_t, a_t))^2]\]

where \(\theta'\) is the target network parameter. The target network parameter is synchronized every target_update_interval iterations.

References

Parameters
  • observation_scaler (d3rlpy.preprocessing.ObservationScaler) – Observation preprocessor.

  • reward_scaler (d3rlpy.preprocessing.RewardScaler) – Reward preprocessor.

  • learning_rate (float) – Learning rate.

  • optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory.

  • encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory.

  • q_func_factory (d3rlpy.models.q_functions.QFunctionFactory) – Q function factory.

  • batch_size (int) – Mini-batch size.

  • gamma (float) – Discount factor.

  • n_critics (int) – Number of Q functions.

  • target_update_interval (int) – Interval to synchronize the target network.

  • action_scaler (Optional[d3rlpy.preprocessing.action_scalers.ActionScaler]) –

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.qlearning.dqn.DoubleDQN

class d3rlpy.algos.DoubleDQN(config, device, impl=None)[source]

Bases: d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.torch.dqn_impl.DQNImpl, d3rlpy.algos.qlearning.dqn.DQNConfig]

DDPG

class d3rlpy.algos.DDPGConfig(batch_size=256, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, actor_learning_rate=0.0003, critic_learning_rate=0.0003, actor_optim_factory=<factory>, critic_optim_factory=<factory>, actor_encoder_factory=<factory>, critic_encoder_factory=<factory>, q_func_factory=<factory>, tau=0.005, n_critics=1)[source]

Bases: d3rlpy.base.LearnableConfig

Config of Deep Deterministic Policy Gradients algorithm.

DDPG is an actor-critic algorithm that trains a Q function parametrized with \(\theta\) and a policy function parametrized with \(\phi\).

\[L(\theta) = \mathbb{E}_{s_t,\, a_t,\, r_{t+1},\, s_{t+1} \sim D} \Big[(r_{t+1} + \gamma Q_{\theta'}\big(s_{t+1}, \pi_{\phi'}(s_{t+1})) - Q_\theta(s_t, a_t)\big)^2\Big]\]
\[J(\phi) = \mathbb{E}_{s_t \sim D} \Big[Q_\theta\big(s_t, \pi_\phi(s_t)\big)\Big]\]

where \(\theta'\) and \(\phi\) are the target network parameters. There target network parameters are updated every iteration.

\[ \begin{align}\begin{aligned}\theta' \gets \tau \theta + (1 - \tau) \theta'\\\phi' \gets \tau \phi + (1 - \tau) \phi'\end{aligned}\end{align} \]

References

Parameters
  • observation_scaler (d3rlpy.preprocessing.ObservationScaler) – Observation preprocessor.

  • action_scaler (d3rlpy.preprocessing.ActionScaler) – Action preprocessor.

  • reward_scaler (d3rlpy.preprocessing.RewardScaler) – Reward preprocessor.

  • actor_learning_rate (float) – Learning rate for policy function.

  • critic_learning_rate (float) – Learning rate for Q function.

  • actor_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the actor.

  • critic_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the critic.

  • actor_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the actor.

  • critic_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the critic.

  • q_func_factory (d3rlpy.models.q_functions.QFunctionFactory) – Q function factory.

  • batch_size (int) – Mini-batch size.

  • gamma (float) – Discount factor.

  • tau (float) – Target network synchronization coefficiency.

  • n_critics (int) – Number of Q functions for ensemble.

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.qlearning.ddpg.DDPG

class d3rlpy.algos.DDPG(config, device, impl=None)[source]

Bases: d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.torch.ddpg_impl.DDPGImpl, d3rlpy.algos.qlearning.ddpg.DDPGConfig]

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

TD3

class d3rlpy.algos.TD3Config(batch_size=256, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, actor_learning_rate=0.0003, critic_learning_rate=0.0003, actor_optim_factory=<factory>, critic_optim_factory=<factory>, actor_encoder_factory=<factory>, critic_encoder_factory=<factory>, q_func_factory=<factory>, tau=0.005, n_critics=2, target_smoothing_sigma=0.2, target_smoothing_clip=0.5, update_actor_interval=2)[source]

Bases: d3rlpy.base.LearnableConfig

Config of Twin Delayed Deep Deterministic Policy Gradients algorithm.

TD3 is an improved DDPG-based algorithm. Major differences from DDPG are as follows.

  • TD3 has twin Q functions to reduce overestimation bias at TD learning. The number of Q functions can be designated by n_critics.

  • TD3 adds noise to target value estimation to avoid overfitting with the deterministic policy.

  • TD3 updates the policy function after several Q function updates in order to reduce variance of action-value estimation. The interval of the policy function update can be designated by update_actor_interval.

\[L(\theta_i) = \mathbb{E}_{s_t, a_t, r_{t+1}, s_{t+1} \sim D} [(r_{t+1} + \gamma \min_j Q_{\theta_j'}(s_{t+1}, \pi_{\phi'}(s_{t+1}) + \epsilon) - Q_{\theta_i}(s_t, a_t))^2]\]
\[J(\phi) = \mathbb{E}_{s_t \sim D} [\min_i Q_{\theta_i}(s_t, \pi_\phi(s_t))]\]

where \(\epsilon \sim clip (N(0, \sigma), -c, c)\)

References

Parameters
  • observation_scaler (d3rlpy.preprocessing.ObservationScaler) – Observation preprocessor.

  • action_scaler (d3rlpy.preprocessing.ActionScaler) – Action preprocessor.

  • reward_scaler (d3rlpy.preprocessing.RewardScaler) – Reward preprocessor.

  • actor_learning_rate (float) – Learning rate for a policy function.

  • critic_learning_rate (float) – Learning rate for Q functions.

  • actor_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the actor.

  • critic_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the critic.

  • actor_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the actor.

  • critic_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the critic.

  • q_func_factory (d3rlpy.models.q_functions.QFunctionFactory) – Q function factory.

  • batch_size (int) – Mini-batch size.

  • gamma (float) – Discount factor.

  • tau (float) – Target network synchronization coefficiency.

  • n_critics (int) – Number of Q functions for ensemble.

  • target_smoothing_sigma (float) – Standard deviation for target noise.

  • target_smoothing_clip (float) – Clipping range for target noise.

  • update_actor_interval (int) – Interval to update policy function described as delayed policy update in the paper.

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.qlearning.td3.TD3

class d3rlpy.algos.TD3(config, device, impl=None)[source]

Bases: d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.torch.td3_impl.TD3Impl, d3rlpy.algos.qlearning.td3.TD3Config]

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

SAC

class d3rlpy.algos.SACConfig(batch_size=256, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, actor_learning_rate=0.0003, critic_learning_rate=0.0003, temp_learning_rate=0.0003, actor_optim_factory=<factory>, critic_optim_factory=<factory>, temp_optim_factory=<factory>, actor_encoder_factory=<factory>, critic_encoder_factory=<factory>, q_func_factory=<factory>, tau=0.005, n_critics=2, initial_temperature=1.0)[source]

Bases: d3rlpy.base.LearnableConfig

Config Soft Actor-Critic algorithm.

SAC is a DDPG-based maximum entropy RL algorithm, which produces state-of-the-art performance in online RL settings. SAC leverages twin Q functions proposed in TD3. Additionally, delayed policy update in TD3 is also implemented, which is not done in the paper.

\[L(\theta_i) = \mathbb{E}_{s_t,\, a_t,\, r_{t+1},\, s_{t+1} \sim D,\, a_{t+1} \sim \pi_\phi(\cdot|s_{t+1})} \Big[ \big(y - Q_{\theta_i}(s_t, a_t)\big)^2\Big]\]
\[y = r_{t+1} + \gamma \Big(\min_j Q_{\theta_j}(s_{t+1}, a_{t+1}) - \alpha \log \big(\pi_\phi(a_{t+1}|s_{t+1})\big)\Big)\]
\[J(\phi) = \mathbb{E}_{s_t \sim D,\, a_t \sim \pi_\phi(\cdot|s_t)} \Big[\alpha \log (\pi_\phi (a_t|s_t)) - \min_i Q_{\theta_i}\big(s_t, \pi_\phi(a_t|s_t)\big)\Big]\]

The temperature parameter \(\alpha\) is also automatically adjustable.

\[J(\alpha) = \mathbb{E}_{s_t \sim D,\, a_t \sim \pi_\phi(\cdot|s_t)} \bigg[-\alpha \Big(\log \big(\pi_\phi(a_t|s_t)\big) + H\Big)\bigg]\]

where \(H\) is a target entropy, which is defined as \(\dim a\).

References

Parameters
  • observation_scaler (d3rlpy.preprocessing.ObservationScaler) – Observation preprocessor.

  • action_scaler (d3rlpy.preprocessing.ActionScaler) – Action preprocessor.

  • reward_scaler (d3rlpy.preprocessing.RewardScaler) – Reward preprocessor.

  • actor_learning_rate (float) – Learning rate for policy function.

  • critic_learning_rate (float) – Learning rate for Q functions.

  • temp_learning_rate (float) – Learning rate for temperature parameter.

  • actor_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the actor.

  • critic_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the critic.

  • temp_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the temperature.

  • actor_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the actor.

  • critic_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the critic.

  • q_func_factory (d3rlpy.models.q_functions.QFunctionFactory) – Q function factory.

  • batch_size (int) – Mini-batch size.

  • gamma (float) – Discount factor.

  • tau (float) – Target network synchronization coefficiency.

  • n_critics (int) – Number of Q functions for ensemble.

  • initial_temperature (float) – Initial temperature value.

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.qlearning.sac.SAC

class d3rlpy.algos.SAC(config, device, impl=None)[source]

Bases: d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.torch.sac_impl.SACImpl, d3rlpy.algos.qlearning.sac.SACConfig]

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

DiscreteSAC

class d3rlpy.algos.DiscreteSACConfig(batch_size=64, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, actor_learning_rate=0.0003, critic_learning_rate=0.0003, temp_learning_rate=0.0003, actor_optim_factory=<factory>, critic_optim_factory=<factory>, temp_optim_factory=<factory>, actor_encoder_factory=<factory>, critic_encoder_factory=<factory>, q_func_factory=<factory>, n_critics=2, initial_temperature=1.0, target_update_interval=8000)[source]

Bases: d3rlpy.base.LearnableConfig

Config of Soft Actor-Critic algorithm for discrete action-space.

This discrete version of SAC is built based on continuous version of SAC with additional modifications.

The target state-value is calculated as expectation of all action-values.

\[V(s_t) = \pi_\phi (s_t)^T [Q_\theta(s_t) - \alpha \log (\pi_\phi (s_t))]\]

Similarly, the objective function for the temperature parameter is as follows.

\[J(\alpha) = \pi_\phi (s_t)^T [-\alpha (\log(\pi_\phi (s_t)) + H)]\]

Finally, the objective function for the policy function is as follows.

\[J(\phi) = \mathbb{E}_{s_t \sim D} [\pi_\phi(s_t)^T [\alpha \log(\pi_\phi(s_t)) - Q_\theta(s_t)]]\]

References

Parameters
  • observation_scaler (d3rlpy.preprocessing.ObservationScaler) – Observation preprocessor.

  • reward_scaler (d3rlpy.preprocessing.RewardScaler) – Reward preprocessor.

  • actor_learning_rate (float) – Learning rate for policy function.

  • critic_learning_rate (float) – Learning rate for Q functions.

  • temp_learning_rate (float) – Learning rate for temperature parameter.

  • actor_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the actor.

  • critic_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the critic.

  • temp_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the temperature.

  • actor_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the actor.

  • critic_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the critic.

  • q_func_factory (d3rlpy.models.q_functions.QFunctionFactory) – Q function factory.

  • batch_size (int) – Mini-batch size.

  • gamma (float) – Discount factor.

  • n_critics (int) – Number of Q functions for ensemble.

  • initial_temperature (float) – Initial temperature value.

  • action_scaler (Optional[d3rlpy.preprocessing.action_scalers.ActionScaler]) –

  • target_update_interval (int) –

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.qlearning.sac.DiscreteSAC

class d3rlpy.algos.DiscreteSAC(config, device, impl=None)[source]

Bases: d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.torch.sac_impl.DiscreteSACImpl, d3rlpy.algos.qlearning.sac.DiscreteSACConfig]

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

BCQ

class d3rlpy.algos.BCQConfig(batch_size=100, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, actor_learning_rate=0.001, critic_learning_rate=0.001, imitator_learning_rate=0.001, actor_optim_factory=<factory>, critic_optim_factory=<factory>, imitator_optim_factory=<factory>, actor_encoder_factory=<factory>, critic_encoder_factory=<factory>, imitator_encoder_factory=<factory>, q_func_factory=<factory>, tau=0.005, n_critics=2, update_actor_interval=1, lam=0.75, n_action_samples=100, action_flexibility=0.05, rl_start_step=0, beta=0.5)[source]

Bases: d3rlpy.base.LearnableConfig

Config of Batch-Constrained Q-learning algorithm.

BCQ is the very first practical data-driven deep reinforcement learning lgorithm. The major difference from DDPG is that the policy function is represented as combination of conditional VAE and perturbation function in order to remedy extrapolation error emerging from target value estimation.

The encoder and the decoder of the conditional VAE is represented as \(E_\omega\) and \(D_\omega\) respectively.

\[L(\omega) = E_{s_t, a_t \sim D} [(a - \tilde{a})^2 + D_{KL}(N(\mu, \sigma)|N(0, 1))]\]

where \(\mu, \sigma = E_\omega(s_t, a_t)\), \(\tilde{a} = D_\omega(s_t, z)\) and \(z \sim N(\mu, \sigma)\).

The policy function is represented as a residual function with the VAE and the perturbation function represented as \(\xi_\phi (s, a)\).

\[\pi(s, a) = a + \Phi \xi_\phi (s, a)\]

where \(a = D_\omega (s, z)\), \(z \sim N(0, 0.5)\) and \(\Phi\) is a perturbation scale designated by action_flexibility. Although the policy is learned closely to data distribution, the perturbation function can lead to more rewarded states.

BCQ also leverages twin Q functions and computes weighted average over maximum values and minimum values.

\[L(\theta_i) = \mathbb{E}_{s_t, a_t, r_{t+1}, s_{t+1} \sim D} [(y - Q_{\theta_i}(s_t, a_t))^2]\]
\[y = r_{t+1} + \gamma \max_{a_i} [ \lambda \min_j Q_{\theta_j'}(s_{t+1}, a_i) + (1 - \lambda) \max_j Q_{\theta_j'}(s_{t+1}, a_i)]\]

where \(\{a_i \sim D(s_{t+1}, z), z \sim N(0, 0.5)\}_{i=1}^n\). The number of sampled actions is designated with n_action_samples.

Finally, the perturbation function is trained just like DDPG’s policy function.

\[J(\phi) = \mathbb{E}_{s_t \sim D, a_t \sim D_\omega(s_t, z), z \sim N(0, 0.5)} [Q_{\theta_1} (s_t, \pi(s_t, a_t))]\]

At inference time, action candidates are sampled as many as n_action_samples, and the action with highest value estimation is taken.

\[\pi'(s) = \text{argmax}_{\pi(s, a_i)} Q_{\theta_1} (s, \pi(s, a_i))\]

Note

The greedy action is not deterministic because the action candidates are always randomly sampled. This might affect save_policy method and the performance at production.

References

Parameters
  • observation_scaler (d3rlpy.preprocessing.ObservationScaler) – Observation preprocessor.

  • action_scaler (d3rlpy.preprocessing.ActionScaler) – Action preprocessor.

  • reward_scaler (d3rlpy.preprocessing.RewardScaler) – Reward preprocessor.

  • actor_learning_rate (float) – Learning rate for policy function.

  • critic_learning_rate (float) – Learning rate for Q functions.

  • imitator_learning_rate (float) – Learning rate for Conditional VAE.

  • actor_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the actor.

  • critic_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the critic.

  • imitator_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the conditional VAE.

  • actor_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the actor.

  • critic_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the critic.

  • imitator_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the conditional VAE.

  • q_func_factory (d3rlpy.models.q_functions.QFunctionFactory) – Q function factory.

  • batch_size (int) – Mini-batch size.

  • gamma (float) – Discount factor.

  • tau (float) – Target network synchronization coefficiency.

  • n_critics (int) – Number of Q functions for ensemble.

  • update_actor_interval (int) – Interval to update policy function.

  • lam (float) – Weight factor for critic ensemble.

  • n_action_samples (int) – Number of action samples to estimate action-values.

  • action_flexibility (float) – Output scale of perturbation function represented as \(\Phi\).

  • rl_start_step (int) – Steps to start to update policy function and Q functions. If this is large, RL training would be more stabilized.

  • beta (float) – KL reguralization term for Conditional VAE.

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.qlearning.bcq.BCQ

class d3rlpy.algos.BCQ(config, device, impl=None)[source]

Bases: d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.torch.bcq_impl.BCQImpl, d3rlpy.algos.qlearning.bcq.BCQConfig]

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

DiscreteBCQ

class d3rlpy.algos.DiscreteBCQConfig(batch_size=32, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, learning_rate=6.25e-05, optim_factory=<factory>, encoder_factory=<factory>, q_func_factory=<factory>, n_critics=1, action_flexibility=0.3, beta=0.5, target_update_interval=8000, share_encoder=True)[source]

Bases: d3rlpy.base.LearnableConfig

Config of Discrete version of Batch-Constrained Q-learning algorithm.

Discrete version takes theories from the continuous version, but the algorithm is much simpler than that. The imitation function \(G_\omega(a|s)\) is trained as supervised learning just like Behavior Cloning.

\[L(\omega) = \mathbb{E}_{a_t, s_t \sim D} [-\sum_a p(a|s_t) \log G_\omega(a|s_t)]\]

With this imitation function, the greedy policy is defined as follows.

\[\pi(s_t) = \text{argmax}_{a|G_\omega(a|s_t) / \max_{\tilde{a}} G_\omega(\tilde{a}|s_t) > \tau} Q_\theta (s_t, a)\]

which eliminates actions with probabilities \(\tau\) times smaller than the maximum one.

Finally, the loss function is computed in Double DQN style with the above constrained policy.

\[L(\theta) = \mathbb{E}_{s_t, a_t, r_{t+1}, s_{t+1} \sim D} [(r_{t+1} + \gamma Q_{\theta'}(s_{t+1}, \pi(s_{t+1})) - Q_\theta(s_t, a_t))^2]\]

References

Parameters
  • observation_scaler (d3rlpy.preprocessing.ObservationScaler) – Observation preprocessor.

  • reward_scaler (d3rlpy.preprocessing.RewardScaler) – Reward preprocessor.

  • learning_rate (float) – Learning rate.

  • optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory.

  • encoder_factory (d3rlpy.models.encoders.EncoderFactory or str) – Encoder factory.

  • q_func_factory (d3rlpy.models.q_functions.QFunctionFactory or str) – Q function factory.

  • batch_size (int) – Mini-batch size.

  • gamma (float) – Discount factor.

  • n_critics (int) – Number of Q functions for ensemble.

  • action_flexibility (float) – Probability threshold represented as \(\tau\).

  • beta (float) – Reguralization term for imitation function.

  • target_update_interval (int) – Interval to update the target network.

  • share_encoder (bool) – Flag to share encoder between Q-function and imitation models.

  • action_scaler (Optional[d3rlpy.preprocessing.action_scalers.ActionScaler]) –

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.qlearning.bcq.DiscreteBCQ

class d3rlpy.algos.DiscreteBCQ(config, device, impl=None)[source]

Bases: d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.torch.bcq_impl.DiscreteBCQImpl, d3rlpy.algos.qlearning.bcq.DiscreteBCQConfig]

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

BEAR

class d3rlpy.algos.BEARConfig(batch_size=256, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, actor_learning_rate=0.0001, critic_learning_rate=0.0003, imitator_learning_rate=0.0003, temp_learning_rate=0.0001, alpha_learning_rate=0.001, actor_optim_factory=<factory>, critic_optim_factory=<factory>, imitator_optim_factory=<factory>, temp_optim_factory=<factory>, alpha_optim_factory=<factory>, actor_encoder_factory=<factory>, critic_encoder_factory=<factory>, imitator_encoder_factory=<factory>, q_func_factory=<factory>, tau=0.005, n_critics=2, initial_temperature=1.0, initial_alpha=1.0, alpha_threshold=0.05, lam=0.75, n_action_samples=100, n_target_samples=10, n_mmd_action_samples=4, mmd_kernel='laplacian', mmd_sigma=20.0, vae_kl_weight=0.5, warmup_steps=40000)[source]

Bases: d3rlpy.base.LearnableConfig

Config of Bootstrapping Error Accumulation Reduction algorithm.

BEAR is a SAC-based data-driven deep reinforcement learning algorithm.

BEAR constrains the support of the policy function within data distribution by minimizing Maximum Mean Discreptancy (MMD) between the policy function and the approximated beahvior policy function \(\pi_\beta(a|s)\) which is optimized through L2 loss.

\[L(\beta) = \mathbb{E}_{s_t, a_t \sim D, a \sim \pi_\beta(\cdot|s_t)} [(a - a_t)^2]\]

The policy objective is a combination of SAC’s objective and MMD penalty.

\[J(\phi) = J_{SAC}(\phi) - \mathbb{E}_{s_t \sim D} \alpha ( \text{MMD}(\pi_\beta(\cdot|s_t), \pi_\phi(\cdot|s_t)) - \epsilon)\]

where MMD is computed as follows.

\[\text{MMD}(x, y) = \frac{1}{N^2} \sum_{i, i'} k(x_i, x_{i'}) - \frac{2}{NM} \sum_{i, j} k(x_i, y_j) + \frac{1}{M^2} \sum_{j, j'} k(y_j, y_{j'})\]

where \(k(x, y)\) is a gaussian kernel \(k(x, y) = \exp{((x - y)^2 / (2 \sigma^2))}\).

\(\alpha\) is also adjustable through dual gradient decsent where \(\alpha\) becomes smaller if MMD is smaller than the threshold \(\epsilon\).

References

Parameters
  • observation_scaler (d3rlpy.preprocessing.ObservationScaler) – Observation preprocessor.

  • action_scaler (d3rlpy.preprocessing.ActionScaler) – Action preprocessor.

  • reward_scaler (d3rlpy.preprocessing.RewardScaler) – Reward preprocessor.

  • actor_learning_rate (float) – Learning rate for policy function.

  • critic_learning_rate (float) – Learning rate for Q functions.

  • imitator_learning_rate (float) – Learning rate for behavior policy function.

  • temp_learning_rate (float) – Learning rate for temperature parameter.

  • alpha_learning_rate (float) – Learning rate for \(\alpha\).

  • actor_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the actor.

  • critic_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the critic.

  • imitator_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the behavior policy.

  • temp_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the temperature.

  • alpha_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for \(\alpha\).

  • actor_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the actor.

  • critic_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the critic.

  • imitator_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the behavior policy.

  • q_func_factory (d3rlpy.models.q_functions.QFunctionFactory) – Q function factory.

  • batch_size (int) – Mini-batch size.

  • gamma (float) – Discount factor.

  • tau (float) – Target network synchronization coefficiency.

  • n_critics (int) – Number of Q functions for ensemble.

  • initial_temperature (float) – Initial temperature value.

  • initial_alpha (float) – Initial \(\alpha\) value.

  • alpha_threshold (float) – Threshold value described as \(\epsilon\).

  • lam (float) – Weight for critic ensemble.

  • n_action_samples (int) – Number of action samples to compute the best action.

  • n_target_samples (int) – Number of action samples to compute BCQ-like target value.

  • n_mmd_action_samples (int) – Number of action samples to compute MMD.

  • mmd_kernel (str) – MMD kernel function. The available options are ['gaussian', 'laplacian'].

  • mmd_sigma (float) – \(\sigma\) for gaussian kernel in MMD calculation.

  • vae_kl_weight (float) – Constant weight to scale KL term for behavior policy training.

  • warmup_steps (int) – Number of steps to warmup the policy function.

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.qlearning.bear.BEAR

class d3rlpy.algos.BEAR(config, device, impl=None)[source]

Bases: d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.torch.bear_impl.BEARImpl, d3rlpy.algos.qlearning.bear.BEARConfig]

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

CRR

class d3rlpy.algos.CRRConfig(batch_size=100, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, actor_learning_rate=0.0003, critic_learning_rate=0.0003, actor_optim_factory=<factory>, critic_optim_factory=<factory>, actor_encoder_factory=<factory>, critic_encoder_factory=<factory>, q_func_factory=<factory>, beta=1.0, n_action_samples=4, advantage_type='mean', weight_type='exp', max_weight=20.0, n_critics=1, target_update_type='hard', tau=0.005, target_update_interval=100, update_actor_interval=1)[source]

Bases: d3rlpy.base.LearnableConfig

Config of Critic Reguralized Regression algorithm.

CRR is a simple offline RL method similar to AWAC.

The policy is trained as a supervised regression.

\[J(\phi) = \mathbb{E}_{s_t, a_t \sim D} [\log \pi_\phi(a_t|s_t) f(Q_\theta, \pi_\phi, s_t, a_t)]\]

where \(f\) is a filter function which has several options. The first option is binary function.

\[f := \mathbb{1} [A_\theta(s, a) > 0]\]

The other is exp function.

\[f := \exp(A(s, a) / \beta)\]

The \(A(s, a)\) is an average function which also has several options. The first option is mean.

\[A(s, a) = Q_\theta (s, a) - \frac{1}{m} \sum^m_j Q(s, a_j)\]

The other one is max.

\[A(s, a) = Q_\theta (s, a) - \max^m_j Q(s, a_j)\]

where \(a_j \sim \pi_\phi(s)\).

In evaluation, the action is determined by Critic Weighted Policy (CWP). In CWP, the several actions are sampled from the policy function, and the final action is re-sampled from the estimated action-value distribution.

References

Parameters
  • observation_scaler (d3rlpy.preprocessing.ObservationScaler) – Observation preprocessor.

  • action_scaler (d3rlpy.preprocessing.ActionScaler) – Action preprocessor.

  • reward_scaler (d3rlpy.preprocessing.RewardScaler) – Reward preprocessor.

  • actor_learning_rate (float) – Learning rate for policy function.

  • critic_learning_rate (float) – Learning rate for Q functions.

  • actor_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the actor.

  • critic_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the critic.

  • actor_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the actor.

  • critic_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the critic.

  • q_func_factory (d3rlpy.models.q_functions.QFunctionFactory) – Q function factory.

  • batch_size (int) – Mini-batch size.

  • gamma (float) – Discount factor.

  • beta (float) – Temperature value defined as \(\beta\) above.

  • n_action_samples (int) – Number of sampled actions to calculate \(A(s, a)\) and for CWP.

  • advantage_type (str) – Advantage function type. The available options are ['mean', 'max'].

  • weight_type (str) – Filter function type. The available options are ['binary', 'exp'].

  • max_weight (float) – Maximum weight for cross-entropy loss.

  • n_critics (int) – Number of Q functions for ensemble.

  • target_update_type (str) – Target update type. The available options are ['hard', 'soft'].

  • tau (float) – Target network synchronization coefficiency used with soft target update.

  • update_actor_interval (int) – Interval to update policy function used with hard target update.

  • target_update_interval (int) –

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.qlearning.crr.CRR

class d3rlpy.algos.CRR(config, device, impl=None)[source]

Bases: d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.torch.crr_impl.CRRImpl, d3rlpy.algos.qlearning.crr.CRRConfig]

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

CQL

class d3rlpy.algos.CQLConfig(batch_size=256, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, actor_learning_rate=0.0001, critic_learning_rate=0.0003, temp_learning_rate=0.0001, alpha_learning_rate=0.0001, actor_optim_factory=<factory>, critic_optim_factory=<factory>, temp_optim_factory=<factory>, alpha_optim_factory=<factory>, actor_encoder_factory=<factory>, critic_encoder_factory=<factory>, q_func_factory=<factory>, tau=0.005, n_critics=2, initial_temperature=1.0, initial_alpha=1.0, alpha_threshold=10.0, conservative_weight=5.0, n_action_samples=10, soft_q_backup=False)[source]

Bases: d3rlpy.base.LearnableConfig

Config of Conservative Q-Learning algorithm.

CQL is a SAC-based data-driven deep reinforcement learning algorithm, which achieves state-of-the-art performance in offline RL problems.

CQL mitigates overestimation error by minimizing action-values under the current policy and maximizing values under data distribution for underestimation issue.

\[L(\theta_i) = \alpha\, \mathbb{E}_{s_t \sim D} \left[\log{\sum_a \exp{Q_{\theta_i}(s_t, a)}} - \mathbb{E}_{a \sim D} \big[Q_{\theta_i}(s_t, a)\big] - \tau\right] + L_\mathrm{SAC}(\theta_i)\]

where \(\alpha\) is an automatically adjustable value via Lagrangian dual gradient descent and \(\tau\) is a threshold value. If the action-value difference is smaller than \(\tau\), the \(\alpha\) will become smaller. Otherwise, the \(\alpha\) will become larger to aggressively penalize action-values.

In continuous control, \(\log{\sum_a \exp{Q(s, a)}}\) is computed as follows.

\[\log{\sum_a \exp{Q(s, a)}} \approx \log{\left( \frac{1}{2N} \sum_{a_i \sim \text{Unif}(a)}^N \left[\frac{\exp{Q(s, a_i)}}{\text{Unif}(a)}\right] + \frac{1}{2N} \sum_{a_i \sim \pi_\phi(a|s)}^N \left[\frac{\exp{Q(s, a_i)}}{\pi_\phi(a_i|s)}\right]\right)}\]

where \(N\) is the number of sampled actions.

The rest of optimization is exactly same as d3rlpy.algos.SAC.

References

Parameters
  • observation_scaler (d3rlpy.preprocessing.ObservationScaler) – Observation preprocessor.

  • action_scaler (d3rlpy.preprocessing.ActionScaler) – Action preprocessor.

  • reward_scaler (d3rlpy.preprocessing.RewardScaler) – Reward preprocessor.

  • actor_learning_rate (float) – Learning rate for policy function.

  • critic_learning_rate (float) – Learning rate for Q functions.

  • temp_learning_rate (float) – Learning rate for temperature parameter of SAC.

  • alpha_learning_rate (float) – Learning rate for \(\alpha\).

  • actor_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the actor.

  • critic_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the critic.

  • temp_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the temperature.

  • alpha_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for \(\alpha\).

  • actor_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the actor.

  • critic_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the critic.

  • q_func_factory (d3rlpy.models.q_functions.QFunctionFactory) – Q function factory.

  • batch_size (int) – Mini-batch size.

  • gamma (float) – Discount factor.

  • tau (float) – Target network synchronization coefficiency.

  • n_critics (int) – Number of Q functions for ensemble.

  • initial_temperature (float) – Initial temperature value.

  • initial_alpha (float) – Initial \(\alpha\) value.

  • alpha_threshold (float) – Threshold value described as \(\tau\).

  • conservative_weight (float) – Constant weight to scale conservative loss.

  • n_action_samples (int) – Number of sampled actions to compute \(\log{\sum_a \exp{Q(s, a)}}\).

  • soft_q_backup (bool) – Flag to use SAC-style backup.

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.qlearning.cql.CQL

class d3rlpy.algos.CQL(config, device, impl=None)[source]

Bases: d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.torch.cql_impl.CQLImpl, d3rlpy.algos.qlearning.cql.CQLConfig]

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

DiscreteCQL

class d3rlpy.algos.DiscreteCQLConfig(batch_size=32, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, learning_rate=6.25e-05, optim_factory=<factory>, encoder_factory=<factory>, q_func_factory=<factory>, n_critics=1, target_update_interval=8000, alpha=1.0)[source]

Bases: d3rlpy.base.LearnableConfig

Config of Discrete version of Conservative Q-Learning algorithm.

Discrete version of CQL is a DoubleDQN-based data-driven deep reinforcement learning algorithm (the original paper uses DQN), which achieves state-of-the-art performance in offline RL problems.

CQL mitigates overestimation error by minimizing action-values under the current policy and maximizing values under data distribution for underestimation issue.

\[L(\theta) = \alpha \mathbb{E}_{s_t \sim D} [\log{\sum_a \exp{Q_{\theta}(s_t, a)}} - \mathbb{E}_{a \sim D} [Q_{\theta}(s, a)]] + L_{DoubleDQN}(\theta)\]

References

Parameters
  • observation_scaler (d3rlpy.preprocessing.ObservationScaler) – Observation preprocessor.

  • reward_scaler (d3rlpy.preprocessing.RewardScaler) – Reward preprocessor.

  • learning_rate (float) – Learning rate.

  • optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory.

  • encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory.

  • q_func_factory (d3rlpy.models.q_functions.QFunctionFactory) – Q function factory.

  • batch_size (int) – Mini-batch size.

  • gamma (float) – Discount factor.

  • n_critics (int) – Number of Q functions for ensemble.

  • target_update_interval (int) – Interval to synchronize the target network.

  • alpha (float) – math:alpha value above.

  • action_scaler (Optional[d3rlpy.preprocessing.action_scalers.ActionScaler]) –

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.qlearning.cql.DiscreteCQL

class d3rlpy.algos.DiscreteCQL(config, device, impl=None)[source]

Bases: d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.torch.cql_impl.DiscreteCQLImpl, d3rlpy.algos.qlearning.cql.DiscreteCQLConfig]

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

AWAC

class d3rlpy.algos.AWACConfig(batch_size=1024, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, actor_learning_rate=0.0003, critic_learning_rate=0.0003, actor_optim_factory=<factory>, critic_optim_factory=<factory>, actor_encoder_factory=<factory>, critic_encoder_factory=<factory>, q_func_factory=<factory>, tau=0.005, lam=1.0, n_action_samples=1, n_critics=2)[source]

Bases: d3rlpy.base.LearnableConfig

Config of Advantage Weighted Actor-Critic algorithm.

AWAC is a TD3-based actor-critic algorithm that enables efficient fine-tuning where the policy is trained with offline datasets and is deployed to online training.

The policy is trained as a supervised regression.

\[J(\phi) = \mathbb{E}_{s_t, a_t \sim D} [\log \pi_\phi(a_t|s_t) \exp(\frac{1}{\lambda} A^\pi (s_t, a_t))]\]

where \(A^\pi (s_t, a_t) = Q_\theta(s_t, a_t) - Q_\theta(s_t, a'_t)\) and \(a'_t \sim \pi_\phi(\cdot|s_t)\)

The key difference from AWR is that AWAC uses Q-function trained via TD learning for the better sample-efficiency.

References

Parameters
  • observation_scaler (d3rlpy.preprocessing.ObservationScaler) – Observation preprocessor.

  • action_scaler (d3rlpy.preprocessing.ActionScaler) – Action preprocessor.

  • reward_scaler (d3rlpy.preprocessing.RewardScaler) – Reward preprocessor.

  • actor_learning_rate (float) – Learning rate for policy function.

  • critic_learning_rate (float) – Learning rate for Q functions.

  • actor_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the actor.

  • critic_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the critic.

  • actor_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the actor.

  • critic_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the critic.

  • q_func_factory (d3rlpy.models.q_functions.QFunctionFactory) – Q function factory.

  • batch_size (int) – Mini-batch size.

  • gamma (float) – Discount factor.

  • tau (float) – Target network synchronization coefficiency.

  • lam (float) – \(\lambda\) for weight calculation.

  • n_action_samples (int) – Number of sampled actions to calculate \(A^\pi(s_t, a_t)\).

  • n_critics (int) – Number of Q functions for ensemble.

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.qlearning.awac.AWAC

class d3rlpy.algos.AWAC(config, device, impl=None)[source]

Bases: d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.torch.awac_impl.AWACImpl, d3rlpy.algos.qlearning.awac.AWACConfig]

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

PLAS

class d3rlpy.algos.PLASConfig(batch_size=100, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, actor_learning_rate=0.0001, critic_learning_rate=0.001, imitator_learning_rate=0.0001, actor_optim_factory=<factory>, critic_optim_factory=<factory>, imitator_optim_factory=<factory>, actor_encoder_factory=<factory>, critic_encoder_factory=<factory>, imitator_encoder_factory=<factory>, q_func_factory=<factory>, tau=0.005, n_critics=2, lam=0.75, warmup_steps=500000, beta=0.5)[source]

Bases: d3rlpy.base.LearnableConfig

Config of Policy in Latent Action Space algorithm.

PLAS is an offline deep reinforcement learning algorithm whose policy function is trained in latent space of Conditional VAE. Unlike other algorithms, PLAS can achieve good performance by using its less constrained policy function.

\[a \sim p_\beta (a|s, z=\pi_\phi(s))\]

where \(\beta\) is a parameter of the decoder in Conditional VAE.

References

Parameters
  • observation_scaler (d3rlpy.preprocessing.ObservationScaler) – Observation preprocessor.

  • action_scaler (d3rlpy.preprocessing.ActionScaler) – Action preprocessor.

  • reward_scaler (d3rlpy.preprocessing.RewardScaler) – Reward preprocessor.

  • actor_learning_rate (float) – Learning rate for policy function.

  • critic_learning_rate (float) – Learning rate for Q functions.

  • imitator_learning_rate (float) – Learning rate for Conditional VAE.

  • actor_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the actor.

  • critic_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the critic.

  • imitator_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the conditional VAE.

  • actor_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the actor.

  • critic_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the critic.

  • imitator_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the conditional VAE.

  • q_func_factory (d3rlpy.models.q_functions.QFunctionFactory) – Q function factory.

  • batch_size (int) – Mini-batch size.

  • gamma (float) – Discount factor.

  • tau (float) – Target network synchronization coefficiency.

  • n_critics (int) – Number of Q functions for ensemble.

  • lam (float) – Weight factor for critic ensemble.

  • warmup_steps (int) – Number of steps to warmup the VAE.

  • beta (float) – KL reguralization term for Conditional VAE.

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.qlearning.plas.PLAS

class d3rlpy.algos.PLAS(config, device, impl=None)[source]

Bases: d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.torch.plas_impl.PLASImpl, d3rlpy.algos.qlearning.plas.PLASConfig]

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

PLAS+P

class d3rlpy.algos.PLASWithPerturbationConfig(batch_size=100, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, actor_learning_rate=0.0001, critic_learning_rate=0.001, imitator_learning_rate=0.0001, actor_optim_factory=<factory>, critic_optim_factory=<factory>, imitator_optim_factory=<factory>, actor_encoder_factory=<factory>, critic_encoder_factory=<factory>, imitator_encoder_factory=<factory>, q_func_factory=<factory>, tau=0.005, n_critics=2, lam=0.75, warmup_steps=500000, beta=0.5, action_flexibility=0.05)[source]

Bases: d3rlpy.algos.qlearning.plas.PLASConfig

Config of Policy in Latent Action Space algorithm with perturbation layer.

PLAS with perturbation layer enables PLAS to output out-of-distribution action.

References

Parameters
  • observation_scaler (d3rlpy.preprocessing.ObservationScaler) – Observation preprocessor.

  • action_scaler (d3rlpy.preprocessing.ActionScaler) – Action preprocessor.

  • reward_scaler (d3rlpy.preprocessing.RewardScaler) – Reward preprocessor.

  • actor_learning_rate (float) – Learning rate for policy function.

  • critic_learning_rate (float) – Learning rate for Q functions.

  • imitator_learning_rate (float) – Learning rate for Conditional VAE.

  • actor_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the actor.

  • critic_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the critic.

  • imitator_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the conditional VAE.

  • actor_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the actor.

  • critic_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the critic.

  • imitator_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the conditional VAE.

  • q_func_factory (d3rlpy.models.q_functions.QFunctionFactory) – Q function factory.

  • batch_size (int) – Mini-batch size.

  • gamma (float) – Discount factor.

  • tau (float) – Target network synchronization coefficiency.

  • n_critics (int) – Number of Q functions for ensemble.

  • update_actor_interval (int) – Interval to update policy function.

  • lam (float) – Weight factor for critic ensemble.

  • action_flexibility (float) – Output scale of perturbation layer.

  • warmup_steps (int) – Number of steps to warmup the VAE.

  • beta (float) – KL reguralization term for Conditional VAE.

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.qlearning.plas.PLASWithPerturbation

class d3rlpy.algos.PLASWithPerturbation(config, device, impl=None)[source]

Bases: d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.torch.plas_impl.PLASImpl, d3rlpy.algos.qlearning.plas.PLASConfig]

TD3+BC

class d3rlpy.algos.TD3PlusBCConfig(batch_size=256, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, actor_learning_rate=0.0003, critic_learning_rate=0.0003, actor_optim_factory=<factory>, critic_optim_factory=<factory>, actor_encoder_factory=<factory>, critic_encoder_factory=<factory>, q_func_factory=<factory>, tau=0.005, n_critics=2, target_smoothing_sigma=0.2, target_smoothing_clip=0.5, alpha=2.5, update_actor_interval=2)[source]

Bases: d3rlpy.base.LearnableConfig

Config of TD3+BC algorithm.

TD3+BC is an simple offline RL algorithm built on top of TD3. TD3+BC introduces BC-reguralized policy objective function.

\[J(\phi) = \mathbb{E}_{s,a \sim D} [\lambda Q(s, \pi(s)) - (a - \pi(s))^2]\]

where

\[\lambda = \frac{\alpha}{\frac{1}{N} \sum_(s_i, a_i) |Q(s_i, a_i)|}\]

References

Parameters
  • observation_scaler (d3rlpy.preprocessing.ObservationScaler) – Observation preprocessor.

  • action_scaler (d3rlpy.preprocessing.ActionScaler) – Action preprocessor.

  • reward_scaler (d3rlpy.preprocessing.RewardScaler) – Reward preprocessor.

  • actor_learning_rate (float) – Learning rate for a policy function.

  • critic_learning_rate (float) – Learning rate for Q functions.

  • actor_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the actor.

  • critic_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the critic.

  • actor_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the actor.

  • critic_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the critic.

  • q_func_factory (d3rlpy.models.q_functions.QFunctionFactory) – Q function factory.

  • batch_size (int) – Mini-batch size.

  • gamma (float) – Discount factor.

  • tau (float) – Target network synchronization coefficiency.

  • n_critics (int) – Number of Q functions for ensemble.

  • target_smoothing_sigma (float) – Standard deviation for target noise.

  • target_smoothing_clip (float) – Clipping range for target noise.

  • alpha (float) – \(\alpha\) value.

  • update_actor_interval (int) – Interval to update policy function described as delayed policy update in the paper.

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.qlearning.td3_plus_bc.TD3PlusBC

class d3rlpy.algos.TD3PlusBC(config, device, impl=None)[source]

Bases: d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.torch.td3_plus_bc_impl.TD3PlusBCImpl, d3rlpy.algos.qlearning.td3_plus_bc.TD3PlusBCConfig]

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

IQL

class d3rlpy.algos.IQLConfig(batch_size=256, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, actor_learning_rate=0.0003, critic_learning_rate=0.0003, actor_optim_factory=<factory>, critic_optim_factory=<factory>, actor_encoder_factory=<factory>, critic_encoder_factory=<factory>, value_encoder_factory=<factory>, tau=0.005, n_critics=2, expectile=0.7, weight_temp=3.0, max_weight=100.0)[source]

Bases: d3rlpy.base.LearnableConfig

Implicit Q-Learning algorithm.

IQL is the offline RL algorithm that avoids ever querying values of unseen actions while still being able to perform multi-step dynamic programming updates.

There are three functions to train in IQL. First the state-value function is trained via expectile regression.

\[L_V(\psi) = \mathbb{E}_{(s, a) \sim D} [L_2^\tau (Q_\theta (s, a) - V_\psi (s))]\]

where \(L_2^\tau (u) = |\tau - \mathbb{1}(u < 0)|u^2\).

The Q-function is trained with the state-value function to avoid query the actions.

\[L_Q(\theta) = \mathbb{E}_{(s, a, r, s') \sim D} [(r + \gamma V_\psi(s') - Q_\theta(s, a))^2]\]

Finally, the policy function is trained by using advantage weighted regression.

\[L_\pi (\phi) = \mathbb{E}_{(s, a) \sim D} [\exp(\beta (Q_\theta - V_\psi(s))) \log \pi_\phi(a|s)]\]

References

Parameters
  • observation_scaler (d3rlpy.preprocessing.ObservationScaler) – Observation preprocessor.

  • action_scaler (d3rlpy.preprocessing.ActionScaler) – Action preprocessor.

  • reward_scaler (d3rlpy.preprocessing.RewardScaler) – Reward preprocessor.

  • actor_learning_rate (float) – Learning rate for policy function.

  • critic_learning_rate (float) – Learning rate for Q functions.

  • actor_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the actor.

  • critic_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory for the critic.

  • actor_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the actor.

  • critic_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the critic.

  • value_encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory for the value function.

  • batch_size (int) – Mini-batch size.

  • gamma (float) – Discount factor.

  • tau (float) – Target network synchronization coefficiency.

  • n_critics (int) – Number of Q functions for ensemble.

  • expectile (float) – Expectile value for value function training.

  • weight_temp (float) – Inverse temperature value represented as \(\beta\).

  • max_weight (float) – Maximum advantage weight value to clip.

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.qlearning.iql.IQL

class d3rlpy.algos.IQL(config, device, impl=None)[source]

Bases: d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.torch.iql_impl.IQLImpl, d3rlpy.algos.qlearning.iql.IQLConfig]

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

RandomPolicy

class d3rlpy.algos.RandomPolicyConfig(batch_size=256, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, distribution='uniform', normal_std=1.0)[source]

Bases: d3rlpy.base.LearnableConfig

Random Policy for continuous control algorithm.

This is designed for data collection and lightweight interaction tests. fit and fit_online methods will raise exceptions.

Parameters
  • action_scaler (d3rlpy.preprocessing.ActionScaler) – Action preprocessor.

  • distribution (str) – Random distribution. Available options are ['uniform', 'normal'].

  • normal_std (float) – Standard deviation of the normal distribution. This is only used when distribution='normal'.

  • batch_size (int) –

  • gamma (float) –

  • observation_scaler (Optional[d3rlpy.preprocessing.observation_scalers.ObservationScaler]) –

  • reward_scaler (Optional[d3rlpy.preprocessing.reward_scalers.RewardScaler]) –

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.qlearning.random_policy.RandomPolicy

class d3rlpy.algos.RandomPolicy(config)[source]

Bases: d3rlpy.algos.qlearning.base.QLearningAlgoBase[None, d3rlpy.algos.qlearning.random_policy.RandomPolicyConfig]

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

predict(x)[source]

Returns greedy actions.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

actions = algo.predict(x)
# actions.shape == (100, action size) for continuous control
# actions.shape == (100,) for discrete control
Parameters

x (Union[numpy.ndarray[Any, numpy.dtype[Any]], Sequence[numpy.ndarray[Any, numpy.dtype[Any]]]]) – Observations

Returns

Greedy actions

Return type

numpy.ndarray[Any, numpy.dtype[Any]]

predict_value(x, action)[source]

Returns predicted action-values.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

# for continuous control
# 100 actions with shape of (2,)
actions = np.random.random((100, 2))

# for discrete control
# 100 actions in integer values
actions = np.random.randint(2, size=100)

values = algo.predict_value(x, actions)
# values.shape == (100,)
Parameters
Returns

Predicted action-values

Return type

numpy.ndarray[Any, numpy.dtype[Any]]

sample_action(x)[source]

Returns sampled actions.

The sampled actions are identical to the output of predict method if the policy is deterministic.

Parameters

x (Union[numpy.ndarray[Any, numpy.dtype[Any]], Sequence[numpy.ndarray[Any, numpy.dtype[Any]]]]) – Observations.

Returns

Sampled actions.

Return type

numpy.ndarray[Any, numpy.dtype[Any]]

DiscreteRandomPolicy

class d3rlpy.algos.DiscreteRandomPolicyConfig(batch_size=256, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None)[source]

Bases: d3rlpy.base.LearnableConfig

Random Policy for discrete control algorithm.

This is designed for data collection and lightweight interaction tests. fit and fit_online methods will raise exceptions.

Parameters
  • batch_size (int) –

  • gamma (float) –

  • observation_scaler (Optional[d3rlpy.preprocessing.observation_scalers.ObservationScaler]) –

  • action_scaler (Optional[d3rlpy.preprocessing.action_scalers.ActionScaler]) –

  • reward_scaler (Optional[d3rlpy.preprocessing.reward_scalers.RewardScaler]) –

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.qlearning.random_policy.DiscreteRandomPolicy

class d3rlpy.algos.DiscreteRandomPolicy(config)[source]

Bases: d3rlpy.algos.qlearning.base.QLearningAlgoBase[None, d3rlpy.algos.qlearning.random_policy.DiscreteRandomPolicyConfig]

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

predict(x)[source]

Returns greedy actions.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

actions = algo.predict(x)
# actions.shape == (100, action size) for continuous control
# actions.shape == (100,) for discrete control
Parameters

x (Union[numpy.ndarray[Any, numpy.dtype[Any]], Sequence[numpy.ndarray[Any, numpy.dtype[Any]]]]) – Observations

Returns

Greedy actions

Return type

numpy.ndarray[Any, numpy.dtype[Any]]

predict_value(x, action)[source]

Returns predicted action-values.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

# for continuous control
# 100 actions with shape of (2,)
actions = np.random.random((100, 2))

# for discrete control
# 100 actions in integer values
actions = np.random.randint(2, size=100)

values = algo.predict_value(x, actions)
# values.shape == (100,)
Parameters
Returns

Predicted action-values

Return type

numpy.ndarray[Any, numpy.dtype[Any]]

sample_action(x)[source]

Returns sampled actions.

The sampled actions are identical to the output of predict method if the policy is deterministic.

Parameters

x (Union[numpy.ndarray[Any, numpy.dtype[Any]], Sequence[numpy.ndarray[Any, numpy.dtype[Any]]]]) – Observations.

Returns

Sampled actions.

Return type

numpy.ndarray[Any, numpy.dtype[Any]]

Decision Transformer

Decision Transformer-based algorithms usually require tricky interaction codes for evaluation. In d3rlpy, those algorithms provide as_stateful_wrapper method to easily integrate the algorithm into your system.

import d3rlpy

dataset, env = d3rlpy.datasets.get_pendulum()

dt = d3rlpy.algos.DecisionTransformerConfig().create(device="cuda:0")

# offline training
dt.fit(
   dataset,
   n_steps=100000,
   n_steps_per_epoch=1000,
   eval_env=env,
   eval_target_return=0,  # specify target environment return
)

# wrap as stateful actor for interaction
actor = dt.as_stateful_wrapper(target_return=0)

# interaction
observation, reward = env.reset(), 0.0
while True:
    action = actor.predict(observation, reward)
    observation, reward, done, truncated, _ = env.step(action)
    if done or truncated:
        break

# reset history
actor.reset()

TransformerAlgoBase

class d3rlpy.algos.TransformerAlgoBase(config, device, impl=None)[source]

Bases: Generic[d3rlpy.algos.transformer.base.TTransformerImpl, d3rlpy.algos.transformer.base.TTransformerConfig], d3rlpy.base.LearnableBase[d3rlpy.algos.transformer.base.TTransformerImpl, d3rlpy.algos.transformer.base.TTransformerConfig]

as_stateful_wrapper(target_return, action_sampler=None)[source]

Returns a wrapped Transformer algorithm for stateful decision making.

Parameters
Returns

StatefulTransformerWrapper object.

Return type

d3rlpy.algos.transformer.base.StatefulTransformerWrapper[d3rlpy.algos.transformer.base.TTransformerImpl, d3rlpy.algos.transformer.base.TTransformerConfig]

fit(dataset, n_steps, n_steps_per_epoch=10000, experiment_name=None, with_timestamp=True, logger_adapter=<d3rlpy.logging.file_adapter.FileAdapterFactory object>, show_progress=True, eval_env=None, eval_target_return=None, eval_action_sampler=None, save_interval=1, callback=None)[source]

Trains with given dataset.

Parameters
  • dataset (d3rlpy.dataset.replay_buffer.ReplayBuffer) – Offline dataset to train.

  • n_steps (int) – Number of steps to train.

  • n_steps_per_epoch (int) – Number of steps per epoch. This value will be ignored when n_steps is None.

  • experiment_name (Optional[str]) – Experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.

  • with_timestamp (bool) – Flag to add timestamp string to the last of directory name.

  • logger_adapter (d3rlpy.logging.logger.LoggerAdapterFactory) – LoggerAdapterFactory object.

  • show_progress (bool) – Flag to show progress bar for iterations.

  • eval_env (Optional[Union[gym.core.Env[Any, Any], gymnasium.core.Env[Any, Any]]]) – Evaluation environment.

  • eval_target_return (Optional[float]) – Evaluation return target.

  • eval_action_sampler (Optional[d3rlpy.algos.transformer.action_samplers.TransformerActionSampler]) – Action sampler used in evaluation.

  • save_interval (int) – Interval to save parameters.

  • callback (Optional[Callable[[typing_extensions.Self, int, int], None]]) – Callable function that takes (algo, epoch, total_step) , which is called every step.

Return type

None

predict(inpt)[source]

Returns action.

This is for internal use. For evaluation, use StatefulTransformerWrapper instead.

Parameters

inpt (d3rlpy.algos.transformer.inputs.TransformerInput) – Sequence input.

Returns

Action.

Return type

numpy.ndarray[Any, numpy.dtype[Any]]

update(batch)[source]

Update parameters with mini-batch of data.

Parameters

batch (d3rlpy.dataset.mini_batch.TrajectoryMiniBatch) – Mini-batch data.

Returns

Dictionary of metrics.

Return type

Dict[str, float]

DecisionTransformer

class d3rlpy.algos.DecisionTransformerConfig(batch_size=64, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, context_size=20, max_timestep=1000, learning_rate=0.0001, encoder_factory=<factory>, optim_factory=<factory>, num_heads=1, num_layers=3, attn_dropout=0.1, resid_dropout=0.1, embed_dropout=0.1, activation_type='relu', position_encoding_type=<PositionEncodingType.SIMPLE: 'simple'>, warmup_steps=10000, clip_grad_norm=0.25, compile=False)[source]

Bases: d3rlpy.algos.transformer.base.TransformerConfig

Config of Decision Transformer.

Decision Transformer solves decision-making problems as a sequence modeling problem.

References

Parameters
  • observation_scaler (d3rlpy.preprocessing.ObservationScaler) – Observation preprocessor.

  • action_scaler (d3rlpy.preprocessing.ActionScaler) – Action preprocessor.

  • reward_scaler (d3rlpy.preprocessing.RewardScaler) – Reward preprocessor.

  • context_size (int) – Prior sequence length.

  • max_timestep (int) – Maximum environmental timestep.

  • batch_size (int) – Mini-batch size.

  • learning_rate (float) – Learning rate.

  • encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory.

  • optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory.

  • num_heads (int) – Number of attention heads.

  • num_layers (int) – Number of attention blocks.

  • attn_dropout (float) – Dropout probability for attentions.

  • resid_dropout (float) – Dropout probability for residual connection.

  • embed_dropout (float) – Dropout probability for embeddings.

  • activation_type (str) – Type of activation function.

  • position_encoding_type (d3rlpy.PositionEncodingType) – Type of positional encoding (SIMPLE or GLOBAL).

  • warmup_steps (int) – Warmup steps for learning rate scheduler.

  • clip_grad_norm (float) – Norm of gradient clipping.

  • compile (bool) – (experimental) Flag to enable JIT compilation.

  • gamma (float) –

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.transformer.decision_transformer.DecisionTransformer

class d3rlpy.algos.DecisionTransformer(config, device, impl=None)[source]

Bases: d3rlpy.algos.transformer.base.TransformerAlgoBase[d3rlpy.algos.transformer.torch.decision_transformer_impl.DecisionTransformerImpl, d3rlpy.algos.transformer.decision_transformer.DecisionTransformerConfig]

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

DiscreteDecisionTransformer

class d3rlpy.algos.DiscreteDecisionTransformerConfig(batch_size=128, gamma=0.99, observation_scaler=None, action_scaler=None, reward_scaler=None, context_size=20, max_timestep=1000, learning_rate=0.0006, encoder_factory=<factory>, optim_factory=<factory>, num_heads=8, num_layers=6, attn_dropout=0.1, resid_dropout=0.1, embed_dropout=0.1, activation_type='gelu', embed_activation_type='tanh', position_encoding_type=<PositionEncodingType.GLOBAL: 'global'>, warmup_tokens=10240, final_tokens=30000000, clip_grad_norm=1.0, compile=False)[source]

Bases: d3rlpy.algos.transformer.base.TransformerConfig

Config of Decision Transformer for discrte action-space.

Decision Transformer solves decision-making problems as a sequence modeling problem.

References

Parameters
  • observation_scaler (d3rlpy.preprocessing.ObservationScaler) – Observation preprocessor.

  • reward_scaler (d3rlpy.preprocessing.RewardScaler) – Reward preprocessor.

  • context_size (int) – Prior sequence length.

  • max_timestep (int) – Maximum environmental timestep.

  • batch_size (int) – Mini-batch size.

  • learning_rate (float) – Learning rate.

  • encoder_factory (d3rlpy.models.encoders.EncoderFactory) – Encoder factory.

  • optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – Optimizer factory.

  • num_heads (int) – Number of attention heads.

  • num_layers (int) – Number of attention blocks.

  • attn_dropout (float) – Dropout probability for attentions.

  • resid_dropout (float) – Dropout probability for residual connection.

  • embed_dropout (float) – Dropout probability for embeddings.

  • activation_type (str) – Type of activation function.

  • embed_activation_type (str) – Type of activation function applied to embeddings.

  • position_encoding_type (d3rlpy.PositionEncodingType) – Type of positional encoding (SIMPLE or GLOBAL).

  • warmup_tokens (int) – Number of tokens to warmup learning rate scheduler.

  • final_tokens (int) – Final number of tokens for learning rate scheduler.

  • clip_grad_norm (float) – Norm of gradient clipping.

  • compile (bool) – (experimental) Flag to enable JIT compilation.

  • gamma (float) –

  • action_scaler (Optional[d3rlpy.preprocessing.action_scalers.ActionScaler]) –

Return type

None

create(device=False)[source]

Returns algorithm object.

Parameters

device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

d3rlpy.algos.transformer.decision_transformer.DiscreteDecisionTransformer

class d3rlpy.algos.DiscreteDecisionTransformer(config, device, impl=None)[source]

Bases: d3rlpy.algos.transformer.base.TransformerAlgoBase[d3rlpy.algos.transformer.torch.decision_transformer_impl.DiscreteDecisionTransformerImpl, d3rlpy.algos.transformer.decision_transformer.DiscreteDecisionTransformerConfig]

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

TransformerActionSampler

TransformerActionSampler is an interface to sample actions from DecisionTransformer outputs. Basically, the default action-sampler will be used if you don’t explicitly specify one.

import d3rlpy

dataset, env = d3rlpy.datasets.get_pendulum()

dt = d3rlpy.algos.DecisionTransformerConfig().create(device="cuda:0")

# offline training
dt.fit(
   dataset,
   n_steps=100000,
   n_steps_per_epoch=1000,
   eval_env=env,
   eval_target_return=0,
   # manually specify action-sampler
   eval_action_sampler=d3rlpy.algos.IdentityTransformerActionSampler(),
)

# wrap as stateful actor for interaction with manually specified action-sampler
actor = dt.as_stateful_wrapper(
    target_return=0,
    action_sampler=d3rlpy.algos.IdentityTransformerActionSampler(),
)

d3rlpy.algos.TransformerActionSampler

Interface of TransformerActionSampler.

d3rlpy.algos.SoftmaxTransformerActionSampler

Softmax action-sampler.

d3rlpy.algos.GreedyTransformerActionSampler

Greedy action-sampler.