d3rlpy.algos.TD3¶

class d3rlpy.algos.TD3(*, actor_learning_rate=0.0003, critic_learning_rate=0.0003, actor_optim_factory=d3rlpy.models.optimizers.AdamFactory(optim_cls='Adam', betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False), critic_optim_factory=d3rlpy.models.optimizers.AdamFactory(optim_cls='Adam', betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False), actor_encoder_factory='default', critic_encoder_factory='default', q_func_factory='mean', batch_size=100, n_frames=1, n_steps=1, gamma=0.99, tau=0.005, n_critics=2, target_reduction_type='min', target_smoothing_sigma=0.2, target_smoothing_clip=0.5, update_actor_interval=2, use_gpu=False, scaler=None, action_scaler=None, reward_scaler=None, impl=None, **kwargs)[source]¶

Twin Delayed Deep Deterministic Policy Gradients algorithm.

TD3 is an improved DDPG-based algorithm. Major differences from DDPG are as follows.

TD3 has twin Q functions to reduce overestimation bias at TD learning. The number of Q functions can be designated by n_critics.
TD3 adds noise to target value estimation to avoid overfitting with the deterministic policy.
TD3 updates the policy function after several Q function updates in order to reduce variance of action-value estimation. The interval of the policy function update can be designated by update_actor_interval.

\[L(\theta_i) = \mathbb{E}_{s_t, a_t, r_{t+1}, s_{t+1} \sim D} [(r_{t+1} + \gamma \min_j Q_{\theta_j'}(s_{t+1}, \pi_{\phi'}(s_{t+1}) + \epsilon) - Q_{\theta_i}(s_t, a_t))^2]\]

\[J(\phi) = \mathbb{E}_{s_t \sim D} [\min_i Q_{\theta_i}(s_t, \pi_\phi(s_t))]\]

where \(\epsilon \sim clip (N(0, \sigma), -c, c)\)

References

Fujimoto et al., Addressing Function Approximation Error in Actor-Critic Methods.

Parameters

actor_learning_rate (float) – learning rate for a policy function.
critic_learning_rate (float) – learning rate for Q functions.
actor_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – optimizer factory for the actor.
critic_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – optimizer factory for the critic.
actor_encoder_factory (d3rlpy.models.encoders.EncoderFactory or str) – encoder factory for the actor.
critic_encoder_factory (d3rlpy.models.encoders.EncoderFactory or str) – encoder factory for the critic.
q_func_factory (d3rlpy.models.q_functions.QFunctionFactory or str) – Q function factory.
batch_size (int) – mini-batch size.
n_frames (int) – the number of frames to stack for image observation.
n_steps (int) – N-step TD calculation.
gamma (float) – discount factor.
tau (float) – target network synchronization coefficiency.
n_critics (int) – the number of Q functions for ensemble.
target_reduction_type (str) – ensemble reduction method at target value estimation. The available options are ['min', 'max', 'mean', 'mix', 'none'].
target_smoothing_sigma (float) – standard deviation for target noise.
target_smoothing_clip (float) – clipping range for target noise.
update_actor_interval (int) – interval to update policy function described as delayed policy update in the paper.
use_gpu (bool, int or d3rlpy.gpu.Device) – flag to use GPU, device ID or device.
scaler (d3rlpy.preprocessing.Scaler or str) – preprocessor. The available options are [‘pixel’, ‘min_max’, ‘standard’].
action_scaler (d3rlpy.preprocessing.ActionScaler or str) – action preprocessor. The available options are ['min_max'].
reward_scaler (d3rlpy.preprocessing.RewardScaler or str) – reward preprocessor. The available options are ['clip', 'min_max', 'standard'].
impl (d3rlpy.algos.torch.td3_impl.TD3Impl) – algorithm implementation.
kwargs (Any) –

Methods

build_with_dataset(dataset)¶

Instantiate implementation object with MDPDataset object.

Parameters: dataset (d3rlpy.dataset.MDPDataset) – dataset.
Return type: None

build_with_env(env)¶

Instantiate implementation object with OpenAI Gym object.

Parameters: env (gym.core.Env) – gym-like environment.
Return type: None

collect(env, buffer=None, explorer=None, n_steps=1000000, show_progress=True, timelimit_aware=True)¶

Collects data via interaction with environment.

If buffer is not given, ReplayBuffer will be internally created.

Parameters

env (gym.core.Env) – gym-like environment.
buffer (Optional[d3rlpy.online.buffers.Buffer]) – replay buffer.
explorer (Optional[d3rlpy.online.explorers.Explorer]) – action explorer.
n_steps (int) – the number of total steps to train.
show_progress (bool) – flag to show progress bar for iterations.
timelimit_aware (bool) – flag to turn terminal flag False when TimeLimit.truncated flag is True, which is designed to incorporate with gym.wrappers.TimeLimit.

Returns

replay buffer with the collected data.

Return type

d3rlpy.online.buffers.Buffer

copy_policy_from(algo)¶

Copies policy parameters from the given algorithm.

# pretrain with static dataset
cql = d3rlpy.algos.CQL()
cql.fit(dataset, n_steps=100000)

# transfer to online algorithmn
sac = d3rlpy.algos.SAC()
sac.create_impl(cql.observation_shape, cql.action_size)
sac.copy_policy_from(cql)

Parameters: algo (d3rlpy.algos.base.AlgoBase) – algorithm object.
Return type: None

copy_q_function_from(algo)¶

Copies Q-function parameters from the given algorithm.

# pretrain with static dataset
cql = d3rlpy.algos.CQL()
cql.fit(dataset, n_steps=100000)

# transfer to online algorithmn
sac = d3rlpy.algos.SAC()
sac.create_impl(cql.observation_shape, cql.action_size)
sac.copy_q_function_from(cql)

Parameters: algo (d3rlpy.algos.base.AlgoBase) – algorithm object.
Return type: None

create_impl(observation_shape, action_size)¶

Instantiate implementation objects with the dataset shapes.

This method will be used internally when fit method is called.

Parameters

observation_shape (Sequence[int]) – observation shape.
action_size (int) – dimension of action-space.

Return type

None

fit(dataset, n_epochs=None, n_steps=None, n_steps_per_epoch=10000, save_metrics=True, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard_dir=None, eval_episodes=None, save_interval=1, scorers=None, shuffle=True, callback=None)¶

Trains with the given dataset.

algo.fit(episodes, n_steps=1000000)

Parameters

dataset (Union[List[d3rlpy.dataset.Episode], d3rlpy.dataset.MDPDataset]) – list of episodes to train.
n_epochs (Optional[int]) – the number of epochs to train.
n_steps (Optional[int]) – the number of steps to train.
n_steps_per_epoch (int) – the number of steps per epoch. This value will be ignored when n_steps is None.
save_metrics (bool) – flag to record metrics in files. If False, the log directory is not created and the model parameters are not saved during training.
experiment_name (Optional[str]) – experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.
with_timestamp (bool) – flag to add timestamp string to the last of directory name.
logdir (str) – root directory name to save logs.
verbose (bool) – flag to show logged information on stdout.
show_progress (bool) – flag to show progress bar for iterations.
tensorboard_dir (Optional[str]) – directory to save logged information in tensorboard (additional to the csv data). if None, the directory will not be created.
eval_episodes (Optional[List[d3rlpy.dataset.Episode]]) – list of episodes to test.
save_interval (int) – interval to save parameters.
scorers (Optional[Dict[str, Callable[[Any, List[d3rlpy.dataset.Episode]], float]]]) – list of scorer functions used with eval_episodes.
shuffle (bool) – flag to shuffle transitions on each epoch.
callback (Optional[Callable[[d3rlpy.base.LearnableBase, int, int], None]]) – callable function that takes (algo, epoch, total_step) , which is called every step.

Returns

list of result tuples (epoch, metrics) per epoch.

Return type

List[Tuple[int, Dict[str, float]]]

fit_batch_online(env, buffer=None, explorer=None, n_epochs=1000, n_steps_per_epoch=1000, n_updates_per_epoch=1000, eval_interval=10, eval_env=None, eval_epsilon=0.0, save_metrics=True, save_interval=1, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard_dir=None, timelimit_aware=True, callback=None)¶

Start training loop of batch online deep reinforcement learning.

Parameters

env (d3rlpy.envs.batch.BatchEnv) – gym-like environment.
buffer (Optional[d3rlpy.online.buffers.BatchBuffer]) – replay buffer.
explorer (Optional[d3rlpy.online.explorers.Explorer]) – action explorer.
n_epochs (int) – the number of epochs to train.
n_steps_per_epoch (int) – the number of steps per epoch.
update_interval – the number of steps per update.
n_updates_per_epoch (int) – the number of updates per epoch.
eval_interval (int) – the number of epochs before evaluation.
eval_env (Optional[gym.core.Env]) – gym-like environment. If None, evaluation is skipped.
eval_epsilon (float) – \(\epsilon\)-greedy factor during evaluation.
save_metrics (bool) – flag to record metrics. If False, the log directory is not created and the model parameters are not saved.
save_interval (int) – the number of epochs before saving models.
experiment_name (Optional[str]) – experiment name for logging. If not passed, the directory name will be {class name}_online_{timestamp}.
with_timestamp (bool) – flag to add timestamp string to the last of directory name.
logdir (str) – root directory name to save logs.
verbose (bool) – flag to show logged information on stdout.
show_progress (bool) – flag to show progress bar for iterations.
tensorboard_dir (Optional[str]) – directory to save logged information in tensorboard (additional to the csv data). if None, the directory will not be created.
timelimit_aware (bool) – flag to turn terminal flag False when TimeLimit.truncated flag is True, which is designed to incorporate with gym.wrappers.TimeLimit.
callback (Optional[Callable[[d3rlpy.online.iterators.AlgoProtocol, int, int], None]]) – callable function that takes (algo, epoch, total_step) , which is called at the end of epochs.

Return type

None

fit_online(env, buffer=None, explorer=None, n_steps=1000000, n_steps_per_epoch=10000, update_interval=1, update_start_step=0, eval_env=None, eval_epsilon=0.0, save_metrics=True, save_interval=1, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard_dir=None, timelimit_aware=True, callback=None)¶

Start training loop of online deep reinforcement learning.

Parameters

env (gym.core.Env) – gym-like environment.
buffer (Optional[d3rlpy.online.buffers.Buffer]) – replay buffer.
explorer (Optional[d3rlpy.online.explorers.Explorer]) – action explorer.
n_steps (int) – the number of total steps to train.
n_steps_per_epoch (int) – the number of steps per epoch.
update_interval (int) – the number of steps per update.
update_start_step (int) – the steps before starting updates.
eval_env (Optional[gym.core.Env]) – gym-like environment. If None, evaluation is skipped.
eval_epsilon (float) – \(\epsilon\)-greedy factor during evaluation.
save_metrics (bool) – flag to record metrics. If False, the log directory is not created and the model parameters are not saved.
save_interval (int) – the number of epochs before saving models.
experiment_name (Optional[str]) – experiment name for logging. If not passed, the directory name will be {class name}_online_{timestamp}.
with_timestamp (bool) – flag to add timestamp string to the last of directory name.
logdir (str) – root directory name to save logs.
verbose (bool) – flag to show logged information on stdout.
show_progress (bool) – flag to show progress bar for iterations.
tensorboard_dir (Optional[str]) – directory to save logged information in tensorboard (additional to the csv data). if None, the directory will not be created.
timelimit_aware (bool) – flag to turn terminal flag False when TimeLimit.truncated flag is True, which is designed to incorporate with gym.wrappers.TimeLimit.
callback (Optional[Callable[[d3rlpy.online.iterators.AlgoProtocol, int, int], None]]) – callable function that takes (algo, epoch, total_step) , which is called at the end of epochs.

Return type

None

fitter(dataset, n_epochs=None, n_steps=None, n_steps_per_epoch=10000, save_metrics=True, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard_dir=None, eval_episodes=None, save_interval=1, scorers=None, shuffle=True, callback=None)¶

Iterate over epochs steps to train with the given dataset. At each: iteration algo methods and properties can be changed or queried.

for epoch, metrics in algo.fitter(episodes):
    my_plot(metrics)
    algo.save_model(my_path)

Parameters

dataset (Union[List[d3rlpy.dataset.Episode], d3rlpy.dataset.MDPDataset]) – list of episodes to train.
n_epochs (Optional[int]) – the number of epochs to train.
n_steps (Optional[int]) – the number of steps to train.
n_steps_per_epoch (int) – the number of steps per epoch. This value will be ignored when n_steps is None.
save_metrics (bool) – flag to record metrics in files. If False, the log directory is not created and the model parameters are not saved during training.
experiment_name (Optional[str]) – experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.
with_timestamp (bool) – flag to add timestamp string to the last of directory name.
logdir (str) – root directory name to save logs.
verbose (bool) – flag to show logged information on stdout.
show_progress (bool) – flag to show progress bar for iterations.
tensorboard_dir (Optional[str]) – directory to save logged information in tensorboard (additional to the csv data). if None, the directory will not be created.
eval_episodes (Optional[List[d3rlpy.dataset.Episode]]) – list of episodes to test.
save_interval (int) – interval to save parameters.
scorers (Optional[Dict[str, Callable[[Any, List[d3rlpy.dataset.Episode]], float]]]) – list of scorer functions used with eval_episodes.
shuffle (bool) – flag to shuffle transitions on each epoch.
callback (Optional[Callable[[d3rlpy.base.LearnableBase, int, int], None]]) – callable function that takes (algo, epoch, total_step) , which is called every step.

Returns

iterator yielding current epoch and metrics dict.

Return type

Generator[Tuple[int, Dict[str, float]], None, None]

classmethod from_json(fname, use_gpu=False)¶

Returns algorithm configured with json file.

The Json file should be the one saved during fitting.

from d3rlpy.algos import Algo

# create algorithm with saved configuration
algo = Algo.from_json('d3rlpy_logs/<path-to-json>/params.json')

# ready to load
algo.load_model('d3rlpy_logs/<path-to-model>/model_100.pt')

# ready to predict
algo.predict(...)

Parameters

fname (str) – file path to params.json.
use_gpu (Optional[Union[bool, int, d3rlpy.gpu.Device]]) – flag to use GPU, device ID or device.

Returns

algorithm.

Return type

d3rlpy.base.LearnableBase

generate_new_data(transitions)¶

Returns generated transitions for data augmentation.

This method is for model-based RL algorithms.

Parameters: transitions (List[d3rlpy.dataset.Transition]) – list of transitions.
Returns: list of new transitions.
Return type: Optional[List[d3rlpy.dataset.Transition]]

get_action_type()[source]¶

Returns action type (continuous or discrete).

Returns: action type.
Return type: d3rlpy.constants.ActionSpace

get_params(deep=True)¶

Returns the all attributes.

This method returns the all attributes including ones in subclasses. Some of scikit-learn utilities will use this method.

params = algo.get_params(deep=True)

# the returned values can be used to instantiate the new object.
algo2 = AlgoBase(**params)

Parameters: deep (bool) – flag to deeply copy objects such as impl.
Returns: attribute values in dictionary.
Return type: Dict[str, Any]

load_model(fname)¶

Load neural network parameters.

algo.load_model('model.pt')

Parameters: fname (str) – source file path.
Return type: None

predict(x)¶

Returns greedy actions.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

actions = algo.predict(x)
# actions.shape == (100, action size) for continuous control
# actions.shape == (100,) for discrete control

Parameters: x (Union[numpy.ndarray, List[Any]]) – observations
Returns: greedy actions
Return type: numpy.ndarray

predict_value(x, action, with_std=False)¶

Returns predicted action-values.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

# for continuous control
# 100 actions with shape of (2,)
actions = np.random.random((100, 2))

# for discrete control
# 100 actions in integer values
actions = np.random.randint(2, size=100)

values = algo.predict_value(x, actions)
# values.shape == (100,)

values, stds = algo.predict_value(x, actions, with_std=True)
# stds.shape  == (100,)

Parameters

x (Union[numpy.ndarray, List[Any]]) – observations
action (Union[numpy.ndarray, List[Any]]) – actions
with_std (bool) – flag to return standard deviation of ensemble estimation. This deviation reflects uncertainty for the given observations. This uncertainty will be more accurate if you enable bootstrap flag and increase n_critics value.

Returns

predicted action-values

Return type

Union[numpy.ndarray, Tuple[numpy.ndarray, numpy.ndarray]]

sample_action(x)¶

Returns sampled actions.

The sampled actions are identical to the output of predict method if the policy is deterministic.

Parameters: x (Union[numpy.ndarray, List[Any]]) – observations.
Returns: sampled actions.
Return type: numpy.ndarray

save_model(fname)¶

Saves neural network parameters.

algo.save_model('model.pt')

Parameters: fname (str) – destination file path.
Return type: None

save_params(logger)¶

Saves configurations as params.json.

Parameters: logger (d3rlpy.logger.D3RLPyLogger) – logger object.
Return type: None

save_policy(fname, as_onnx=False)¶

Save the greedy-policy computational graph as TorchScript or ONNX.

# save as TorchScript
algo.save_policy('policy.pt')

# save as ONNX
algo.save_policy('policy.onnx', as_onnx=True)

The artifacts saved with this method will work without d3rlpy. This method is especially useful to deploy the learned policy to production environments or embedding systems.