d3rlpy.algos.TD3¶
-
class
d3rlpy.algos.
TD3
(actor_learning_rate=0.0003, critic_learning_rate=0.0003, batch_size=100, n_frames=1, gamma=0.99, tau=0.005, reguralizing_rate=0.0, n_critics=2, bootstrap=False, share_encoder=False, target_smoothing_sigma=0.2, target_smoothing_clip=0.5, update_actor_interval=2, eps=1e-08, use_batch_norm=False, q_func_type='mean', n_epochs=1000, use_gpu=False, scaler=None, augmentation=[], n_augmentations=1, encoder_params={}, dynamics=None, impl=None, **kwargs)[source]¶ Twin Delayed Deep Deterministic Policy Gradients algorithm.
TD3 is an improved DDPG-based algorithm. Major differences from DDPG are as follows.
- TD3 has twin Q functions to reduce overestimation bias at TD learning. The number of Q functions can be designated by n_critics.
- TD3 adds noise to target value estimation to avoid overfitting with the deterministic policy.
- TD3 updates the policy function after several Q function updates in order to reduce variance of action-value estimation. The interval of the policy function update can be designated by update_actor_interval.
\[L(\theta_i) = \mathbb{E}_{s_t, a_t, r_{t+1}, s_{t+1} \sim D} [(r_{t+1} + \gamma \min_j Q_{\theta_j'}(s_{t+1}, \pi_{\phi'}(s_{t+1}) + \epsilon) - Q_{\theta_i}(s_t, a_t))^2]\]\[J(\phi) = \mathbb{E}_{s_t \sim D} [\min_i Q_{\theta_i}(s_t, \pi_\phi(s_t))]\]where \(\epsilon \sim clip (N(0, \sigma), -c, c)\)
References
Parameters: - actor_learning_rate (float) – learning rate for a policy function.
- critic_learning_rate (float) – learning rate for Q functions.
- batch_size (int) – mini-batch size.
- n_frames (int) – the number of frames to stack for image observation.
- gamma (float) – discount factor.
- tau (float) – target network synchronization coefficiency.
- reguralizing_rate (float) – reguralizing term for policy function.
- n_critics (int) – the number of Q functions for ensemble.
- bootstrap (bool) – flag to bootstrap Q functions.
- share_encoder (bool) – flag to share encoder network.
- target_smoothing_sigma (float) – standard deviation for target noise.
- target_smoothing_clip (float) – clipping range for target noise.
- update_actor_interval (int) – interval to update policy function described as delayed policy update in the paper.
- eps (float) – \(\epsilon\) for Adam optimizer.
- use_batch_norm (bool) – flag to insert batch normalization layers.
- q_func_type (str) – type of Q function. Available options are [‘mean’, ‘qr’, ‘iqn’, ‘fqf’].
- n_epochs (int) – the number of epochs to train.
- use_gpu (bool, int or d3rlpy.gpu.Device) – flag to use GPU, device ID or device.
- scaler (d3rlpy.preprocessing.Scaler or str) – preprocessor. The available options are [‘pixel’, ‘min_max’, ‘standard’]
- augmentation (d3rlpy.augmentation.AugmentationPipeline or list(str)) – augmentation pipeline.
- n_augmentations (int) – the number of data augmentations to update.
- encoder_params (dict) – optional arguments for encoder setup. If the
observation is pixel, you can pass
filters
with list of tuples consisting with(filter_size, kernel_size, stride)
andfeature_size
with an integer scaler for the last linear layer size. If the observation is vector, you can passhidden_units
with list of hidden unit sizes. - dynamics (d3rlpy.dynamics.base.DynamicsBase) – dynamics model for data augmentation.
- impl (d3rlpy.algos.torch.td3_impl.TD3Impl) – algorithm implementation.
flag to share encoder network.
Type: bool
-
update_actor_interval
¶ interval to update policy function described as delayed policy update in the paper.
Type: int
-
use_gpu
¶ GPU device.
Type: d3rlpy.gpu.Device
-
scaler
¶ preprocessor.
Type: d3rlpy.preprocessing.Scaler
-
augmentation
¶ augmentation pipeline.
Type: d3rlpy.augmentation.AugmentationPipeline
-
dynamics
¶ dynamics model.
Type: d3rlpy.dynamics.base.DynamicsBase
-
impl
¶ algorithm implementation.
Type: d3rlpy.algos.torch.td3_impl.TD3Impl
Methods
-
create_impl
(observation_shape, action_size)[source]¶ Instantiate implementation objects with the dataset shapes.
This method will be used internally when fit method is called.
Parameters:
-
fit
(episodes, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard=True, eval_episodes=None, save_interval=1, scorers=None)¶ Trains with the given dataset.
algo.fit(episodes)
Parameters: - episodes (list(d3rlpy.dataset.Episode)) – list of episodes to train.
- experiment_name (str) – experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.
- with_timestamp (bool) – flag to add timestamp string to the last of directory name.
- logdir (str) – root directory name to save logs.
- verbose (bool) – flag to show logged information on stdout.
- show_progress (bool) – flag to show progress bar for iterations.
- tensorboard (bool) – flag to save logged information in tensorboard (additional to the csv data)
- eval_episodes (list(d3rlpy.dataset.Episode)) – list of episodes to test.
- save_interval (int) – interval to save parameters.
- scorers (list(callable)) – list of scorer functions used with eval_episodes.
-
classmethod
from_json
(fname, use_gpu=False)¶ Returns algorithm configured with json file.
The Json file should be the one saved during fitting.
from d3rlpy.algos import Algo # create algorithm with saved configuration algo = Algo.from_json('d3rlpy_logs/<path-to-json>/params.json') # ready to load algo.load_model('d3rlpy_logs/<path-to-model>/model_100.pt') # ready to predict algo.predict(...)
Parameters: Returns: algorithm.
Return type: d3rlpy.base.LearnableBase
-
get_params
(deep=True)¶ Returns the all attributes.
This method returns the all attributes including ones in subclasses. Some of scikit-learn utilities will use this method.
params = algo.get_params(deep=True) # the returned values can be used to instantiate the new object. algo2 = AlgoBase(**params)
Parameters: deep (bool) – flag to deeply copy objects such as impl. Returns: attribute values in dictionary. Return type: dict
-
load_model
(fname)¶ Load neural network parameters.
algo.load_model('model.pt')
Parameters: fname (str) – source file path.
-
predict
(x)¶ Returns greedy actions.
# 100 observations with shape of (10,) x = np.random.random((100, 10)) actions = algo.predict(x) # actions.shape == (100, action size) for continuous control # actions.shape == (100,) for discrete control
Parameters: x (numpy.ndarray) – observations Returns: greedy actions Return type: numpy.ndarray
-
predict_value
(x, action, with_std=False)¶ Returns predicted action-values.
# 100 observations with shape of (10,) x = np.random.random((100, 10)) # for continuous control # 100 actions with shape of (2,) actions = np.random.random((100, 2)) # for discrete control # 100 actions in integer values actions = np.random.randint(2, size=100) values = algo.predict_value(x, actions) # values.shape == (100,) values, stds = algo.predict_value(x, actions, with_std=True) # stds.shape == (100,)
Parameters: - x (numpy.ndarray) – observations
- action (numpy.ndarray) – actions
- with_std (bool) – flag to return standard deviation of ensemble estimation. This deviation reflects uncertainty for the given observations. This uncertainty will be more accurate if you enable bootstrap flag and increase n_critics value.
Returns: predicted action-values
Return type:
-
sample_action
(x)¶ Returns sampled actions.
The sampled actions are identical to the output of predict method if the policy is deterministic.
Parameters: x (numpy.ndarray) – observations. Returns: sampled actions. Return type: numpy.ndarray
-
save_model
(fname)¶ Saves neural network parameters.
algo.save_model('model.pt')
Parameters: fname (str) – destination file path.
-
save_policy
(fname, as_onnx=False)¶ Save the greedy-policy computational graph as TorchScript or ONNX.
# save as TorchScript algo.save_policy('policy.pt') # save as ONNX algo.save_policy('policy.onnx', as_onnx=True)
The artifacts saved with this method will work without d3rlpy. This method is especially useful to deploy the learned policy to production environments or embedding systems.
See also
- https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html (for Python).
- https://pytorch.org/tutorials/advanced/cpp_export.html (for C++).
- https://onnx.ai (for ONNX)
Parameters:
-
set_params
(**params)¶ Sets the given arguments to the attributes if they exist.
This method sets the given values to the attributes including ones in subclasses. If the values that don’t exist as attributes are passed, they are ignored. Some of scikit-learn utilities will use this method.
algo.set_params(n_epochs=10, batch_size=100)
Parameters: **params – arbitrary inputs to set as attributes. Returns: itself. Return type: d3rlpy.algos.base.AlgoBase