d3rlpy.dynamics.mopo.MOPO¶
-
class
d3rlpy.dynamics.mopo.
MOPO
(*, learning_rate=0.001, optim_factory=<d3rlpy.models.optimizers.AdamFactory object>, encoder_factory='default', batch_size=100, n_frames=1, n_ensembles=5, n_transitions=400, horizon=5, lam=1.0, discrete_action=False, scaler=None, action_scaler=None, use_gpu=False, impl=None, **kwargs)[source]¶ Model-based Offline Policy Optimization.
MOPO is a model-based RL approach for offline policy optimization. MOPO leverages the probablistic ensemble dynamics model to generate new dynamics data with uncertainty penalties.
The ensemble dynamics model consists of \(N\) probablistic models \(\{T_{\theta_i}\}_{i=1}^N\). At each epoch, new transitions are generated via randomly picked dynamics model \(T_\theta\).
\[s_{t+1}, r_{t+1} \sim T_\theta(s_t, a_t)\]where \(s_t \sim D\) for the first step, otherwise \(s_t\) is the previous generated observation, and \(a_t \sim \pi(\cdot|s_t)\). The generated \(r_{t+1}\) would be far from the ground truth if the actions sampled from the policy function is out-of-distribution. Thus, the uncertainty penalty reguralizes this bias.
\[\tilde{r_{t+1}} = r_{t+1} - \lambda \max_{i=1}^N || \Sigma_i (s_t, a_t) ||\]where \(\Sigma(s_t, a_t)\) is the estimated variance.
Finally, the generated transitions \((s_t, a_t, \tilde{r_{t+1}}, s_{t+1})\) are appended to dataset \(D\).
This generation process starts with randomly sampled n_transitions transitions till horizon steps.
Note
Currently, MOPO only supports vector observations.
References
- Parameters
learning_rate (float) – learning rate for dynamics model.
optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – optimizer factory.
encoder_factory (d3rlpy.models.encoders.EncoderFactory or str) – encoder factory.
batch_size (int) – mini-batch size.
n_frames (int) – the number of frames to stack for image observation.
n_ensembles (int) – the number of dynamics model for ensemble.
n_transitions (int) – the number of parallel trajectories to generate.
horizon (int) – the number of steps to generate.
lam (float) – \(\lambda\) for uncertainty penalties.
discrete_action (bool) – flag to take discrete actions.
scaler (d3rlpy.preprocessing.scalers.Scaler or str) – preprocessor. The available options are [‘pixel’, ‘min_max’, ‘standard’].
action_scaler (d3rlpy.preprocessing.Actionscalers or str) – action preprocessor. The available options are
['min_max']
.use_gpu (bool or d3rlpy.gpu.Device) – flag to use GPU or device.
impl (d3rlpy.dynamics.torch.MOPOImpl) – dynamics implementation.
Methods
-
build_with_dataset
(dataset)¶ Instantiate implementation object with MDPDataset object.
- Parameters
dataset (d3rlpy.dataset.MDPDataset) – dataset.
- Return type
-
build_with_env
(env)¶ Instantiate implementation object with OpenAI Gym object.
- Parameters
env (gym.core.Env) – gym-like environment.
- Return type
-
create_impl
(observation_shape, action_size)[source]¶ Instantiate implementation objects with the dataset shapes.
This method will be used internally when fit method is called.
-
fit
(episodes, n_epochs=1000, save_metrics=True, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard=True, eval_episodes=None, save_interval=1, scorers=None, shuffle=True)¶ Trains with the given dataset.
algo.fit(episodes)
- Parameters
episodes (List[d3rlpy.dataset.Episode]) – list of episodes to train.
n_epochs (int) – the number of epochs to train.
save_metrics (bool) – flag to record metrics in files. If False, the log directory is not created and the model parameters are not saved during training.
experiment_name (Optional[str]) – experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.
with_timestamp (bool) – flag to add timestamp string to the last of directory name.
logdir (str) – root directory name to save logs.
verbose (bool) – flag to show logged information on stdout.
show_progress (bool) – flag to show progress bar for iterations.
tensorboard (bool) – flag to save logged information in tensorboard (additional to the csv data)
eval_episodes (Optional[List[d3rlpy.dataset.Episode]]) – list of episodes to test.
save_interval (int) – interval to save parameters.
scorers (Optional[Dict[str, Callable[[Any, List[d3rlpy.dataset.Episode]], float]]]) – list of scorer functions used with eval_episodes.
shuffle (bool) – flag to shuffle transitions on each epoch.
- Return type
-
classmethod
from_json
(fname, use_gpu=False)¶ Returns algorithm configured with json file.
The Json file should be the one saved during fitting.
from d3rlpy.algos import Algo # create algorithm with saved configuration algo = Algo.from_json('d3rlpy_logs/<path-to-json>/params.json') # ready to load algo.load_model('d3rlpy_logs/<path-to-model>/model_100.pt') # ready to predict algo.predict(...)
-
generate
(algo, transitions)¶ Returns new transitions for data augmentation.
- Parameters
algo (d3rlpy.algos.base.AlgoBase) – algorithm.
transitions (List[d3rlpy.dataset.Transition]) – list of transitions.
- Returns
list of generated transitions.
- Return type
-
get_params
(deep=True)¶ Returns the all attributes.
This method returns the all attributes including ones in subclasses. Some of scikit-learn utilities will use this method.
params = algo.get_params(deep=True) # the returned values can be used to instantiate the new object. algo2 = AlgoBase(**params)
-
load_model
(fname)¶ Load neural network parameters.
algo.load_model('model.pt')
-
predict
(x, action, with_variance=False)¶ Returns predicted observation and reward.
- Parameters
x (Union[numpy.ndarray, List[Any]]) – observation
action (Union[numpy.ndarray, List[Any]]) – action
with_variance (bool) – flag to return prediction variance.
- Returns
tuple of predicted observation and reward.
- Return type
Union[Tuple[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]]
-
save_model
(fname)¶ Saves neural network parameters.
algo.save_model('model.pt')
-
save_params
(logger)¶ Saves configurations as params.json.
- Parameters
logger (d3rlpy.logger.D3RLPyLogger) – logger object.
- Return type
-
set_params
(**params)¶ Sets the given arguments to the attributes if they exist.
This method sets the given values to the attributes including ones in subclasses. If the values that don’t exist as attributes are passed, they are ignored. Some of scikit-learn utilities will use this method.
algo.set_params(batch_size=100)
- Parameters
params (Any) – arbitrary inputs to set as attributes.
- Returns
itself.
- Return type
d3rlpy.base.LearnableBase
-
update
(epoch, total_step, batch)[source]¶ Update parameters with mini-batch of data.
- Parameters
epoch (int) – the current number of epochs.
total_step (int) – the current number of total iterations.
batch (d3rlpy.dataset.TransitionMiniBatch) – mini-batch data.
- Returns
loss values.
- Return type
Attributes
-
action_scaler
¶ Preprocessing action scaler.
- Returns
preprocessing action scaler.
- Return type
Optional[ActionScaler]
-
horizon
¶
-
impl
¶ Implementation object.
- Returns
implementation object.
- Return type
Optional[ImplBase]
-
n_frames
¶ Number of frames to stack.
This is only for image observation.
- Returns
number of frames to stack.
- Return type
-
n_transitions
¶
-
observation_shape
¶ Observation shape.
- Returns
observation shape.
- Return type
Optional[Sequence[int]]
-
scaler
¶ Preprocessing scaler.
- Returns
preprocessing scaler.
- Return type
Optional[Scaler]