d3rlpy.algos.TD3¶
- class d3rlpy.algos.TD3(*, actor_learning_rate=0.0003, critic_learning_rate=0.0003, actor_optim_factory=d3rlpy.models.optimizers.AdamFactory(optim_cls='Adam', betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False), critic_optim_factory=d3rlpy.models.optimizers.AdamFactory(optim_cls='Adam', betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False), actor_encoder_factory='default', critic_encoder_factory='default', q_func_factory='mean', batch_size=100, n_frames=1, n_steps=1, gamma=0.99, tau=0.005, n_critics=2, target_reduction_type='min', target_smoothing_sigma=0.2, target_smoothing_clip=0.5, update_actor_interval=2, use_gpu=False, scaler=None, action_scaler=None, reward_scaler=None, impl=None, **kwargs)[source]¶
Twin Delayed Deep Deterministic Policy Gradients algorithm.
TD3 is an improved DDPG-based algorithm. Major differences from DDPG are as follows.
TD3 has twin Q functions to reduce overestimation bias at TD learning. The number of Q functions can be designated by n_critics.
TD3 adds noise to target value estimation to avoid overfitting with the deterministic policy.
TD3 updates the policy function after several Q function updates in order to reduce variance of action-value estimation. The interval of the policy function update can be designated by update_actor_interval.
\[L(\theta_i) = \mathbb{E}_{s_t, a_t, r_{t+1}, s_{t+1} \sim D} [(r_{t+1} + \gamma \min_j Q_{\theta_j'}(s_{t+1}, \pi_{\phi'}(s_{t+1}) + \epsilon) - Q_{\theta_i}(s_t, a_t))^2]\]\[J(\phi) = \mathbb{E}_{s_t \sim D} [\min_i Q_{\theta_i}(s_t, \pi_\phi(s_t))]\]where \(\epsilon \sim clip (N(0, \sigma), -c, c)\)
References
- Parameters
actor_learning_rate (float) – learning rate for a policy function.
critic_learning_rate (float) – learning rate for Q functions.
actor_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – optimizer factory for the actor.
critic_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – optimizer factory for the critic.
actor_encoder_factory (d3rlpy.models.encoders.EncoderFactory or str) – encoder factory for the actor.
critic_encoder_factory (d3rlpy.models.encoders.EncoderFactory or str) – encoder factory for the critic.
q_func_factory (d3rlpy.models.q_functions.QFunctionFactory or str) – Q function factory.
batch_size (int) – mini-batch size.
n_frames (int) – the number of frames to stack for image observation.
n_steps (int) – N-step TD calculation.
gamma (float) – discount factor.
tau (float) – target network synchronization coefficiency.
n_critics (int) – the number of Q functions for ensemble.
target_reduction_type (str) – ensemble reduction method at target value estimation. The available options are
['min', 'max', 'mean', 'mix', 'none']
.target_smoothing_sigma (float) – standard deviation for target noise.
target_smoothing_clip (float) – clipping range for target noise.
update_actor_interval (int) – interval to update policy function described as delayed policy update in the paper.
use_gpu (bool, int or d3rlpy.gpu.Device) – flag to use GPU, device ID or device.
scaler (d3rlpy.preprocessing.Scaler or str) – preprocessor. The available options are [‘pixel’, ‘min_max’, ‘standard’].
action_scaler (d3rlpy.preprocessing.ActionScaler or str) – action preprocessor. The available options are
['min_max']
.reward_scaler (d3rlpy.preprocessing.RewardScaler or str) – reward preprocessor. The available options are
['clip', 'min_max', 'standard']
.impl (d3rlpy.algos.torch.td3_impl.TD3Impl) – algorithm implementation.
kwargs (Any) –
Methods
- build_with_dataset(dataset)¶
Instantiate implementation object with MDPDataset object.
- Parameters
dataset (d3rlpy.dataset.MDPDataset) – dataset.
- Return type
- build_with_env(env)¶
Instantiate implementation object with OpenAI Gym object.
- Parameters
env (gym.core.Env) – gym-like environment.
- Return type
- collect(env, buffer=None, explorer=None, n_steps=1000000, show_progress=True, timelimit_aware=True)¶
Collects data via interaction with environment.
If
buffer
is not given,ReplayBuffer
will be internally created.- Parameters
env (gym.core.Env) – gym-like environment.
buffer (Optional[d3rlpy.online.buffers.Buffer]) – replay buffer.
explorer (Optional[d3rlpy.online.explorers.Explorer]) – action explorer.
n_steps (int) – the number of total steps to train.
show_progress (bool) – flag to show progress bar for iterations.
timelimit_aware (bool) – flag to turn
terminal
flagFalse
whenTimeLimit.truncated
flag isTrue
, which is designed to incorporate withgym.wrappers.TimeLimit
.
- Returns
replay buffer with the collected data.
- Return type
d3rlpy.online.buffers.Buffer
- copy_policy_from(algo)¶
Copies policy parameters from the given algorithm.
# pretrain with static dataset cql = d3rlpy.algos.CQL() cql.fit(dataset, n_steps=100000) # transfer to online algorithmn sac = d3rlpy.algos.SAC() sac.create_impl(cql.observation_shape, cql.action_size) sac.copy_policy_from(cql)
- Parameters
algo (d3rlpy.algos.base.AlgoBase) – algorithm object.
- Return type
- copy_q_function_from(algo)¶
Copies Q-function parameters from the given algorithm.
# pretrain with static dataset cql = d3rlpy.algos.CQL() cql.fit(dataset, n_steps=100000) # transfer to online algorithmn sac = d3rlpy.algos.SAC() sac.create_impl(cql.observation_shape, cql.action_size) sac.copy_q_function_from(cql)
- Parameters
algo (d3rlpy.algos.base.AlgoBase) – algorithm object.
- Return type
- create_impl(observation_shape, action_size)¶
Instantiate implementation objects with the dataset shapes.
This method will be used internally when fit method is called.
- fit(dataset, n_epochs=None, n_steps=None, n_steps_per_epoch=10000, save_metrics=True, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard_dir=None, eval_episodes=None, save_interval=1, scorers=None, shuffle=True, callback=None)¶
Trains with the given dataset.
algo.fit(episodes, n_steps=1000000)
- Parameters
dataset (Union[List[d3rlpy.dataset.Episode], d3rlpy.dataset.MDPDataset]) – list of episodes to train.
n_epochs (Optional[int]) – the number of epochs to train.
n_steps (Optional[int]) – the number of steps to train.
n_steps_per_epoch (int) – the number of steps per epoch. This value will be ignored when
n_steps
isNone
.save_metrics (bool) – flag to record metrics in files. If False, the log directory is not created and the model parameters are not saved during training.
experiment_name (Optional[str]) – experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.
with_timestamp (bool) – flag to add timestamp string to the last of directory name.
logdir (str) – root directory name to save logs.
verbose (bool) – flag to show logged information on stdout.
show_progress (bool) – flag to show progress bar for iterations.
tensorboard_dir (Optional[str]) – directory to save logged information in tensorboard (additional to the csv data). if
None
, the directory will not be created.eval_episodes (Optional[List[d3rlpy.dataset.Episode]]) – list of episodes to test.
save_interval (int) – interval to save parameters.
scorers (Optional[Dict[str, Callable[[Any, List[d3rlpy.dataset.Episode]], float]]]) – list of scorer functions used with eval_episodes.
shuffle (bool) – flag to shuffle transitions on each epoch.
callback (Optional[Callable[[d3rlpy.base.LearnableBase, int, int], None]]) – callable function that takes
(algo, epoch, total_step)
, which is called every step.
- Returns
list of result tuples (epoch, metrics) per epoch.
- Return type
- fit_batch_online(env, buffer=None, explorer=None, n_epochs=1000, n_steps_per_epoch=1000, n_updates_per_epoch=1000, eval_interval=10, eval_env=None, eval_epsilon=0.0, save_metrics=True, save_interval=1, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard_dir=None, timelimit_aware=True, callback=None)¶
Start training loop of batch online deep reinforcement learning.
- Parameters
env (d3rlpy.envs.batch.BatchEnv) – gym-like environment.
buffer (Optional[d3rlpy.online.buffers.BatchBuffer]) – replay buffer.
explorer (Optional[d3rlpy.online.explorers.Explorer]) – action explorer.
n_epochs (int) – the number of epochs to train.
n_steps_per_epoch (int) – the number of steps per epoch.
update_interval – the number of steps per update.
n_updates_per_epoch (int) – the number of updates per epoch.
eval_interval (int) – the number of epochs before evaluation.
eval_env (Optional[gym.core.Env]) – gym-like environment. If None, evaluation is skipped.
eval_epsilon (float) – \(\epsilon\)-greedy factor during evaluation.
save_metrics (bool) – flag to record metrics. If False, the log directory is not created and the model parameters are not saved.
save_interval (int) – the number of epochs before saving models.
experiment_name (Optional[str]) – experiment name for logging. If not passed, the directory name will be
{class name}_online_{timestamp}
.with_timestamp (bool) – flag to add timestamp string to the last of directory name.
logdir (str) – root directory name to save logs.
verbose (bool) – flag to show logged information on stdout.
show_progress (bool) – flag to show progress bar for iterations.
tensorboard_dir (Optional[str]) – directory to save logged information in tensorboard (additional to the csv data). if
None
, the directory will not be created.timelimit_aware (bool) – flag to turn
terminal
flagFalse
whenTimeLimit.truncated
flag isTrue
, which is designed to incorporate withgym.wrappers.TimeLimit
.callback (Optional[Callable[[d3rlpy.online.iterators.AlgoProtocol, int, int], None]]) – callable function that takes
(algo, epoch, total_step)
, which is called at the end of epochs.
- Return type
- fit_online(env, buffer=None, explorer=None, n_steps=1000000, n_steps_per_epoch=10000, update_interval=1, update_start_step=0, eval_env=None, eval_epsilon=0.0, save_metrics=True, save_interval=1, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard_dir=None, timelimit_aware=True, callback=None)¶
Start training loop of online deep reinforcement learning.
- Parameters
env (gym.core.Env) – gym-like environment.
buffer (Optional[d3rlpy.online.buffers.Buffer]) – replay buffer.
explorer (Optional[d3rlpy.online.explorers.Explorer]) – action explorer.
n_steps (int) – the number of total steps to train.
n_steps_per_epoch (int) – the number of steps per epoch.
update_interval (int) – the number of steps per update.
update_start_step (int) – the steps before starting updates.
eval_env (Optional[gym.core.Env]) – gym-like environment. If None, evaluation is skipped.
eval_epsilon (float) – \(\epsilon\)-greedy factor during evaluation.
save_metrics (bool) – flag to record metrics. If False, the log directory is not created and the model parameters are not saved.
save_interval (int) – the number of epochs before saving models.
experiment_name (Optional[str]) – experiment name for logging. If not passed, the directory name will be
{class name}_online_{timestamp}
.with_timestamp (bool) – flag to add timestamp string to the last of directory name.
logdir (str) – root directory name to save logs.
verbose (bool) – flag to show logged information on stdout.
show_progress (bool) – flag to show progress bar for iterations.
tensorboard_dir (Optional[str]) – directory to save logged information in tensorboard (additional to the csv data). if
None
, the directory will not be created.timelimit_aware (bool) – flag to turn
terminal
flagFalse
whenTimeLimit.truncated
flag isTrue
, which is designed to incorporate withgym.wrappers.TimeLimit
.callback (Optional[Callable[[d3rlpy.online.iterators.AlgoProtocol, int, int], None]]) – callable function that takes
(algo, epoch, total_step)
, which is called at the end of epochs.
- Return type
- fitter(dataset, n_epochs=None, n_steps=None, n_steps_per_epoch=10000, save_metrics=True, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard_dir=None, eval_episodes=None, save_interval=1, scorers=None, shuffle=True, callback=None)¶
- Iterate over epochs steps to train with the given dataset. At each
iteration algo methods and properties can be changed or queried.
for epoch, metrics in algo.fitter(episodes): my_plot(metrics) algo.save_model(my_path)
- Parameters
dataset (Union[List[d3rlpy.dataset.Episode], d3rlpy.dataset.MDPDataset]) – list of episodes to train.
n_epochs (Optional[int]) – the number of epochs to train.
n_steps (Optional[int]) – the number of steps to train.
n_steps_per_epoch (int) – the number of steps per epoch. This value will be ignored when
n_steps
isNone
.save_metrics (bool) – flag to record metrics in files. If False, the log directory is not created and the model parameters are not saved during training.
experiment_name (Optional[str]) – experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.
with_timestamp (bool) – flag to add timestamp string to the last of directory name.
logdir (str) – root directory name to save logs.
verbose (bool) – flag to show logged information on stdout.
show_progress (bool) – flag to show progress bar for iterations.
tensorboard_dir (Optional[str]) – directory to save logged information in tensorboard (additional to the csv data). if
None
, the directory will not be created.eval_episodes (Optional[List[d3rlpy.dataset.Episode]]) – list of episodes to test.
save_interval (int) – interval to save parameters.
scorers (Optional[Dict[str, Callable[[Any, List[d3rlpy.dataset.Episode]], float]]]) – list of scorer functions used with eval_episodes.
shuffle (bool) – flag to shuffle transitions on each epoch.
callback (Optional[Callable[[d3rlpy.base.LearnableBase, int, int], None]]) – callable function that takes
(algo, epoch, total_step)
, which is called every step.
- Returns
iterator yielding current epoch and metrics dict.
- Return type
- classmethod from_json(fname, use_gpu=False)¶
Returns algorithm configured with json file.
The Json file should be the one saved during fitting.
from d3rlpy.algos import Algo # create algorithm with saved configuration algo = Algo.from_json('d3rlpy_logs/<path-to-json>/params.json') # ready to load algo.load_model('d3rlpy_logs/<path-to-model>/model_100.pt') # ready to predict algo.predict(...)
- generate_new_data(transitions)¶
Returns generated transitions for data augmentation.
This method is for model-based RL algorithms.
- Parameters
transitions (List[d3rlpy.dataset.Transition]) – list of transitions.
- Returns
list of new transitions.
- Return type
Optional[List[d3rlpy.dataset.Transition]]
- get_action_type()[source]¶
Returns action type (continuous or discrete).
- Returns
action type.
- Return type
d3rlpy.constants.ActionSpace
- get_params(deep=True)¶
Returns the all attributes.
This method returns the all attributes including ones in subclasses. Some of scikit-learn utilities will use this method.
params = algo.get_params(deep=True) # the returned values can be used to instantiate the new object. algo2 = AlgoBase(**params)
- load_model(fname)¶
Load neural network parameters.
algo.load_model('model.pt')
- predict(x)¶
Returns greedy actions.
# 100 observations with shape of (10,) x = np.random.random((100, 10)) actions = algo.predict(x) # actions.shape == (100, action size) for continuous control # actions.shape == (100,) for discrete control
- Parameters
x (Union[numpy.ndarray, List[Any]]) – observations
- Returns
greedy actions
- Return type
- predict_value(x, action, with_std=False)¶
Returns predicted action-values.
# 100 observations with shape of (10,) x = np.random.random((100, 10)) # for continuous control # 100 actions with shape of (2,) actions = np.random.random((100, 2)) # for discrete control # 100 actions in integer values actions = np.random.randint(2, size=100) values = algo.predict_value(x, actions) # values.shape == (100,) values, stds = algo.predict_value(x, actions, with_std=True) # stds.shape == (100,)
- Parameters
x (Union[numpy.ndarray, List[Any]]) – observations
action (Union[numpy.ndarray, List[Any]]) – actions
with_std (bool) – flag to return standard deviation of ensemble estimation. This deviation reflects uncertainty for the given observations. This uncertainty will be more accurate if you enable
bootstrap
flag and increasen_critics
value.
- Returns
predicted action-values
- Return type
Union[numpy.ndarray, Tuple[numpy.ndarray, numpy.ndarray]]
- sample_action(x)¶
Returns sampled actions.
The sampled actions are identical to the output of predict method if the policy is deterministic.
- Parameters
x (Union[numpy.ndarray, List[Any]]) – observations.
- Returns
sampled actions.
- Return type
- save_model(fname)¶
Saves neural network parameters.
algo.save_model('model.pt')
- save_params(logger)¶
Saves configurations as params.json.
- Parameters
logger (d3rlpy.logger.D3RLPyLogger) – logger object.
- Return type
- save_policy(fname, as_onnx=False)¶
Save the greedy-policy computational graph as TorchScript or ONNX.
# save as TorchScript algo.save_policy('policy.pt') # save as ONNX algo.save_policy('policy.onnx', as_onnx=True)
The artifacts saved with this method will work without d3rlpy. This method is especially useful to deploy the learned policy to production environments or embedding systems.
See also
https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html (for Python).
https://pytorch.org/tutorials/advanced/cpp_export.html (for C++).
https://onnx.ai (for ONNX)
- set_active_logger(logger)¶
Set active D3RLPyLogger object
- Parameters
logger (d3rlpy.logger.D3RLPyLogger) – logger object.
- Return type
- set_grad_step(grad_step)¶
Set total gradient step counter.
This method can be used to restart the middle of training with an arbitrary gradient step counter, which has effects on periodic functions such as the target update.
- set_params(**params)¶
Sets the given arguments to the attributes if they exist.
This method sets the given values to the attributes including ones in subclasses. If the values that don’t exist as attributes are passed, they are ignored. Some of scikit-learn utilities will use this method.
algo.set_params(batch_size=100)
- Parameters
params (Any) – arbitrary inputs to set as attributes.
- Returns
itself.
- Return type
d3rlpy.base.LearnableBase
- update(batch)¶
Update parameters with mini-batch of data.
- Parameters
batch (d3rlpy.dataset.TransitionMiniBatch) – mini-batch data.
- Returns
dictionary of metrics.
- Return type
Attributes
- action_scaler¶
Preprocessing action scaler.
- Returns
preprocessing action scaler.
- Return type
Optional[ActionScaler]
- active_logger¶
Active D3RLPyLogger object.
This will be only available during training.
- Returns
logger object.
- grad_step¶
Total gradient step counter.
This value will keep counting after
fit
andfit_online
methods finish.- Returns
total gradient step counter.
- impl¶
Implementation object.
- Returns
implementation object.
- Return type
Optional[ImplBase]
- n_frames¶
Number of frames to stack.
This is only for image observation.
- Returns
number of frames to stack.
- Return type
- observation_shape¶
Observation shape.
- Returns
observation shape.
- Return type
Optional[Sequence[int]]
- reward_scaler¶
Preprocessing reward scaler.
- Returns
preprocessing reward scaler.
- Return type
Optional[RewardScaler]
- scaler¶
Preprocessing scaler.
- Returns
preprocessing scaler.
- Return type
Optional[Scaler]