d3rlpy.dynamics.mopo.MOPO

class d3rlpy.dynamics.mopo.MOPO(n_epochs=30, batch_size=100, n_frames=1, learning_rate=0.001, eps=1e-08, weight_decay=0.0001, n_ensembles=5, n_transitions=400, horizon=5, lam=1.0, use_batch_norm=False, discrete_action=False, scaler=None, augmentation=[], use_gpu=False, impl=None, **kwargs)[source]

Model-based Offline Policy Optimization.

MOPO is a model-based RL approach for offline policy optimization. MOPO leverages the probablistic ensemble dynamics model to generate new dynamics data with uncertainty penalties.

The ensemble dynamics model consists of \(N\) probablistic models \(\{T_{\theta_i}\}_{i=1}^N\). At each epoch, new transitions are generated via randomly picked dynamics model \(T_\theta\).

\[s_{t+1}, r_{t+1} \sim T_\theta(s_t, a_t)\]

where \(s_t \sim D\) for the first step, otherwise \(s_t\) is the previous generated observation, and \(a_t \sim \pi(\cdot|s_t)\). The generated \(r_{t+1}\) would be far from the ground truth if the actions sampled from the policy function is out-of-distribution. Thus, the uncertainty penalty reguralizes this bias.

\[\tilde{r_{t+1}} = r_{t+1} - \lambda \max_{i=1}^N || \Sigma_i (s_t, a_t) ||\]

where \(\Sigma(s_t, a_t)\) is the estimated variance.

Finally, the generated transitions \((s_t, a_t, \tilde{r_{t+1}}, s_{t+1})\) are appended to dataset \(D\).

This generation process starts with randomly sampled n_transitions transitions till horizon steps.

Note

Currently, MOPO only supports vector observations.

References

Parameters:
  • n_epochs (int) – the number of epochs to train.
  • batch_size (int) – mini-batch size.
  • n_frames (int) – the number of frames to stack for image observation.
  • learning_rate (float) – learning rate for dynamics model.
  • eps (float) – \(\epsilon\) for Adam optimizer.
  • weight_decay (float) – weight decay rate.
  • n_ensembles (int) – the number of dynamics model for ensemble.
  • n_transitions (int) – the number of parallel trajectories to generate.
  • horizon (int) – the number of steps to generate.
  • lam (float) – \(\lambda\) for uncertainty penalties.
  • use_batch_norm (bool) – flag to insert batch normalization layers.
  • discrete_action (bool) – flag to take discrete actions.
  • scaler (d3rlpy.preprocessing.scalers.Scaler or str) – preprocessor. The available options are [‘pixel’, ‘min_max’, ‘standard’].
  • augmentation (d3rlpy.augmentation.AugmentationPipeline or list(str)) – augmentation pipeline.
  • use_gpu (bool or d3rlpy.gpu.Device) – flag to use GPU or device.
  • impl (d3rlpy.dynamics.base.DynamicsImplBase) – dynamics implementation.
n_epochs

the number of epochs to train.

Type:int
batch_size

mini-batch size.

Type:int
n_frames

the number of frames to stack for image observation.

Type:int
learning_rate

learning rate for dynamics model.

Type:float
eps

\(\epsilon\) for Adam optimizer.

Type:float
weight_decay

weight decay rate.

Type:float
n_ensembles

the number of dynamics model for ensemble.

Type:int
n_transitions

the number of parallel trajectories to generate.

Type:int
horizon

the number of steps to generate.

Type:int
lam

\(\lambda\) for uncertainty penalties.

Type:float
use_batch_norm

flag to insert batch normalization layers.

Type:bool
discrete_action

flag to take discrete actions.

Type:bool
scaler

preprocessor.

Type:d3rlpy.preprocessing.scalers.Scaler
augmentation

augmentation pipeline.

Type:d3rlpy.augmentation.AugmentationPipeline
use_gpu

flag to use GPU or device.

Type:d3rlpy.gpu.Device
impl

dynamics implementation.

Type:d3rlpy.dynamics.base.DynamicsImplBase
eval_results_

evaluation results.

Type:dict

Methods

create_impl(observation_shape, action_size)[source]

Instantiate implementation objects with the dataset shapes.

This method will be used internally when fit method is called.

Parameters:
  • observation_shape (tuple) – observation shape.
  • action_size (int) – dimension of action-space.
fit(episodes, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard=True, eval_episodes=None, save_interval=1, scorers=None)

Trains with the given dataset.

algo.fit(episodes)
Parameters:
  • episodes (list(d3rlpy.dataset.Episode)) – list of episodes to train.
  • experiment_name (str) – experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.
  • with_timestamp (bool) – flag to add timestamp string to the last of directory name.
  • logdir (str) – root directory name to save logs.
  • verbose (bool) – flag to show logged information on stdout.
  • show_progress (bool) – flag to show progress bar for iterations.
  • tensorboard (bool) – flag to save logged information in tensorboard (additional to the csv data)
  • eval_episodes (list(d3rlpy.dataset.Episode)) – list of episodes to test.
  • save_interval (int) – interval to save parameters.
  • scorers (list(callable)) – list of scorer functions used with eval_episodes.
classmethod from_json(fname, use_gpu=False)

Returns algorithm configured with json file.

The Json file should be the one saved during fitting.

from d3rlpy.algos import Algo

# create algorithm with saved configuration
algo = Algo.from_json('d3rlpy_logs/<path-to-json>/params.json')

# ready to load
algo.load_model('d3rlpy_logs/<path-to-model>/model_100.pt')

# ready to predict
algo.predict(...)
Parameters:
  • fname (str) – file path to params.json.
  • use_gpu (bool, int or d3rlpy.gpu.Device) – flag to use GPU, device ID or device.
Returns:

algorithm.

Return type:

d3rlpy.base.LearnableBase

generate(algo, transitions)

Returns new transitions for data augmentation.

Parameters:
Returns:

list of generated transitions.

Return type:

list(d3rlpy.dataset.Transition)

get_params(deep=True)

Returns the all attributes.

This method returns the all attributes including ones in subclasses. Some of scikit-learn utilities will use this method.

params = algo.get_params(deep=True)

# the returned values can be used to instantiate the new object.
algo2 = AlgoBase(**params)
Parameters:deep (bool) – flag to deeply copy objects such as impl.
Returns:attribute values in dictionary.
Return type:dict
load_model(fname)

Load neural network parameters.

algo.load_model('model.pt')
Parameters:fname (str) – source file path.
predict(x, action, with_variance=False)

Returns predicted observation and reward.

Parameters:
Returns:

tuple of predicted observation and reward.

Return type:

tuple

save_model(fname)

Saves neural network parameters.

algo.save_model('model.pt')
Parameters:fname (str) – destination file path.
set_params(**params)

Sets the given arguments to the attributes if they exist.

This method sets the given values to the attributes including ones in subclasses. If the values that don’t exist as attributes are passed, they are ignored. Some of scikit-learn utilities will use this method.

algo.set_params(n_epochs=10, batch_size=100)
Parameters:**params – arbitrary inputs to set as attributes.
Returns:itself.
Return type:d3rlpy.algos.base.AlgoBase
update(epoch, total_step, batch)[source]

Update parameters with mini-batch of data.

Parameters:
Returns:

loss values.

Return type:

list