d3rlpy.algos.TD3¶

class d3rlpy.algos.TD3(actor_learning_rate=0.0003, critic_learning_rate=0.0003, batch_size=100, n_frames=1, gamma=0.99, tau=0.005, reguralizing_rate=0.0, n_critics=2, bootstrap=False, share_encoder=False, target_smoothing_sigma=0.2, target_smoothing_clip=0.5, update_actor_interval=2, eps=1e-08, use_batch_norm=False, q_func_type='mean', n_epochs=1000, use_gpu=False, scaler=None, augmentation=[], n_augmentations=1, encoder_params={}, dynamics=None, impl=None, **kwargs)[source]¶

Twin Delayed Deep Deterministic Policy Gradients algorithm.

TD3 is an improved DDPG-based algorithm. Major differences from DDPG are as follows.

TD3 has twin Q functions to reduce overestimation bias at TD learning. The number of Q functions can be designated by n_critics.
TD3 adds noise to target value estimation to avoid overfitting with the deterministic policy.
TD3 updates the policy function after several Q function updates in order to reduce variance of action-value estimation. The interval of the policy function update can be designated by update_actor_interval.

\[L(\theta_i) = \mathbb{E}_{s_t, a_t, r_{t+1}, s_{t+1} \sim D} [(r_{t+1} + \gamma \min_j Q_{\theta_j'}(s_{t+1}, \pi_{\phi'}(s_{t+1}) + \epsilon) - Q_{\theta_i}(s_t, a_t))^2]\]

\[J(\phi) = \mathbb{E}_{s_t \sim D} [\min_i Q_{\theta_i}(s_t, \pi_\phi(s_t))]\]

where \(\epsilon \sim clip (N(0, \sigma), -c, c)\)

References

Fujimoto et al., Addressing Function Approximation Error in Actor-Critic Methods.

Parameters:

actor_learning_rate (float) – learning rate for a policy function.
critic_learning_rate (float) – learning rate for Q functions.
batch_size (int) – mini-batch size.
n_frames (int) – the number of frames to stack for image observation.
gamma (float) – discount factor.
tau (float) – target network synchronization coefficiency.
reguralizing_rate (float) – reguralizing term for policy function.
n_critics (int) – the number of Q functions for ensemble.
bootstrap (bool) – flag to bootstrap Q functions.
share_encoder (bool) – flag to share encoder network.
target_smoothing_sigma (float) – standard deviation for target noise.
target_smoothing_clip (float) – clipping range for target noise.
update_actor_interval (int) – interval to update policy function described as delayed policy update in the paper.
eps (float) – \(\epsilon\) for Adam optimizer.
use_batch_norm (bool) – flag to insert batch normalization layers.
q_func_type (str) – type of Q function. Available options are [‘mean’, ‘qr’, ‘iqn’, ‘fqf’].
n_epochs (int) – the number of epochs to train.
use_gpu (bool, int or d3rlpy.gpu.Device) – flag to use GPU, device ID or device.
scaler (d3rlpy.preprocessing.Scaler or str) – preprocessor. The available options are [‘pixel’, ‘min_max’, ‘standard’]
augmentation (d3rlpy.augmentation.AugmentationPipeline or list(str)) – augmentation pipeline.
n_augmentations (int) – the number of data augmentations to update.
encoder_params (dict) – optional arguments for encoder setup. If the observation is pixel, you can pass filters with list of tuples consisting with (filter_size, kernel_size, stride) and feature_size with an integer scaler for the last linear layer size. If the observation is vector, you can pass hidden_units with list of hidden unit sizes.
dynamics (d3rlpy.dynamics.base.DynamicsBase) – dynamics model for data augmentation.
impl (d3rlpy.algos.torch.td3_impl.TD3Impl) – algorithm implementation.

actor_learning_rate¶

learning rate for a policy function.

Type:	float

critic_learning_rate¶

learning rate for Q functions.

Type:	float

batch_size¶

mini-batch size.

Type:	int

n_frames¶

the number of frames to stack for image observation.

Type:	int

gamma¶

discount factor.

Type:	float

tau¶

target network synchronization coefficiency.

Type:	float

reguralizing_rate¶

reguralizing term for policy function.

Type:	float

n_critics¶

the number of Q functions for ensemble.

Type:	int

bootstrap¶

flag to bootstrap Q functions.

Type:	bool

share_encoder¶

flag to share encoder network.

Type:	bool

target_smoothing_sigma¶

standard deviation for target noise.

Type:	float

target_smoothing_clip¶

clipping range for target noise.

Type:	float

update_actor_interval¶

interval to update policy function described as delayed policy update in the paper.

Type:	int

eps¶

\(\epsilon\) for Adam optimizer.

Type:	float

use_batch_norm¶

flag to insert batch normalization layers.

Type:	bool

q_func_type¶

type of Q function..

Type:	str

n_epochs¶

the number of epochs to train.

Type:	int

use_gpu¶

GPU device.

Type:	d3rlpy.gpu.Device

scaler¶

preprocessor.

Type:	d3rlpy.preprocessing.Scaler

augmentation¶

augmentation pipeline.

Type:	d3rlpy.augmentation.AugmentationPipeline

n_augmentations¶

the number of data augmentations to update.

Type:	int

encoder_params¶

optional arguments for encoder setup.

Type:	dict

dynamics¶

dynamics model.

Type:	d3rlpy.dynamics.base.DynamicsBase

impl¶

algorithm implementation.

Type:	d3rlpy.algos.torch.td3_impl.TD3Impl

eval_results_¶

evaluation results.

Type:	dict

Methods

create_impl(observation_shape, action_size)[source]¶

Instantiate implementation objects with the dataset shapes.

This method will be used internally when fit method is called.

Parameters:	observation_shape (tuple) – observation shape. action_size (int) – dimension of action-space.

fit(episodes, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard=True, eval_episodes=None, save_interval=1, scorers=None)¶

Trains with the given dataset.

algo.fit(episodes)

Parameters:

episodes (list(d3rlpy.dataset.Episode)) – list of episodes to train.
experiment_name (str) – experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.
with_timestamp (bool) – flag to add timestamp string to the last of directory name.
logdir (str) – root directory name to save logs.
verbose (bool) – flag to show logged information on stdout.
show_progress (bool) – flag to show progress bar for iterations.
tensorboard (bool) – flag to save logged information in tensorboard (additional to the csv data)
eval_episodes (list(d3rlpy.dataset.Episode)) – list of episodes to test.
save_interval (int) – interval to save parameters.
scorers (list(callable)) – list of scorer functions used with eval_episodes.

classmethod from_json(fname, use_gpu=False)¶

Returns algorithm configured with json file.

The Json file should be the one saved during fitting.

from d3rlpy.algos import Algo

# create algorithm with saved configuration
algo = Algo.from_json('d3rlpy_logs/<path-to-json>/params.json')

# ready to load
algo.load_model('d3rlpy_logs/<path-to-model>/model_100.pt')

# ready to predict
algo.predict(...)

Parameters:	fname (str) – file path to params.json. use_gpu (bool, int or d3rlpy.gpu.Device) – flag to use GPU, device ID or device.
Returns:	algorithm.
Return type:	d3rlpy.base.LearnableBase

get_params(deep=True)¶

Returns the all attributes.

This method returns the all attributes including ones in subclasses. Some of scikit-learn utilities will use this method.

params = algo.get_params(deep=True)

# the returned values can be used to instantiate the new object.
algo2 = AlgoBase(**params)

Parameters:	deep (bool) – flag to deeply copy objects such as impl.
Returns:	attribute values in dictionary.
Return type:	dict

load_model(fname)¶

Load neural network parameters.

algo.load_model('model.pt')

Parameters:	fname (str) – source file path.

predict(x)¶

Returns greedy actions.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

actions = algo.predict(x)
# actions.shape == (100, action size) for continuous control
# actions.shape == (100,) for discrete control

Parameters:	x (numpy.ndarray) – observations
Returns:	greedy actions
Return type:	numpy.ndarray

predict_value(x, action, with_std=False)¶

Returns predicted action-values.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

# for continuous control
# 100 actions with shape of (2,)
actions = np.random.random((100, 2))

# for discrete control
# 100 actions in integer values
actions = np.random.randint(2, size=100)

values = algo.predict_value(x, actions)
# values.shape == (100,)

values, stds = algo.predict_value(x, actions, with_std=True)
# stds.shape  == (100,)

Parameters:	x (numpy.ndarray) – observations action (numpy.ndarray) – actions with_std (bool) – flag to return standard deviation of ensemble estimation. This deviation reflects uncertainty for the given observations. This uncertainty will be more accurate if you enable bootstrap flag and increase n_critics value.
Returns:	predicted action-values
Return type:	numpy.ndarray

sample_action(x)¶

Returns sampled actions.

The sampled actions are identical to the output of predict method if the policy is deterministic.

Parameters:	x (numpy.ndarray) – observations.
Returns:	sampled actions.
Return type:	numpy.ndarray

save_model(fname)¶

Saves neural network parameters.

algo.save_model('model.pt')

Parameters:	fname (str) – destination file path.

save_policy(fname, as_onnx=False)¶

Save the greedy-policy computational graph as TorchScript or ONNX.

# save as TorchScript
algo.save_policy('policy.pt')

# save as ONNX
algo.save_policy('policy.onnx', as_onnx=True)

The artifacts saved with this method will work without d3rlpy. This method is especially useful to deploy the learned policy to production environments or embedding systems.

Parameters:	**params – arbitrary inputs to set as attributes.
Returns:	itself.
Return type:	d3rlpy.algos.base.AlgoBase