d3rlpy.algos.DiscreteAWR

class d3rlpy.algos.DiscreteAWR(*, actor_learning_rate=5e-05, critic_learning_rate=0.0001, actor_optim_factory=<d3rlpy.models.optimizers.SGDFactory object>, critic_optim_factory=<d3rlpy.models.optimizers.SGDFactory object>, actor_encoder_factory='default', critic_encoder_factory='default', batch_size=2048, n_frames=1, gamma=0.99, batch_size_per_update=256, n_actor_updates=1000, n_critic_updates=200, lam=0.95, beta=1.0, max_weight=20.0, use_gpu=False, scaler=None, action_scaler=None, augmentation=None, generator=None, impl=None, **kwargs)[source]

Discrete veriosn of Advantage-Weighted Regression algorithm.

AWR is an actor-critic algorithm that trains via supervised regression way, and has shown strong performance in online and offline settings.

The value function is trained as a supervised regression problem.

\[L(\theta) = \mathbb{E}_{s_t, R_t \sim D} [(R_t - V(s_t|\theta))^2]\]

where \(R_t\) is approximated using TD(\(\lambda\)) to mitigate high variance issue.

The policy function is also trained as a supervised regression problem.

\[J(\phi) = \mathbb{E}_{s_t, a_t, R_t \sim D} [\log \pi(a_t|s_t, \phi) \exp (\frac{1}{B} (R_t - V(s_t|\theta)))]\]

where \(B\) is a constant factor.

References

Parameters
  • actor_learning_rate (float) – learning rate for policy function.

  • critic_learning_rate (float) – learning rate for value function.

  • actor_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – optimizer factory for the actor.

  • critic_optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – optimizer factory for the critic.

  • actor_encoder_factory (d3rlpy.models.encoders.EncoderFactory or str) – encoder factory for the actor.

  • critic_encoder_factory (d3rlpy.models.encoders.EncoderFactory or str) – encoder factory for the critic.

  • batch_size (int) – batch size per iteration.

  • n_frames (int) – the number of frames to stack for image observation.

  • gamma (float) – discount factor.

  • batch_size_per_update (int) – mini-batch size.

  • n_actor_updates (int) – actor gradient steps per iteration.

  • n_critic_updates (int) – critic gradient steps per iteration.

  • lam (float) – \(\lambda\) for TD(\(\lambda\)).

  • beta (float) – \(B\) for weight scale.

  • max_weight (float) – \(w_{\text{max}}\) for weight clipping.

  • use_gpu (bool, int or d3rlpy.gpu.Device) – flag to use GPU, device ID or device.

  • scaler (d3rlpy.preprocessing.Scaler or str) – preprocessor. The available options are [‘pixel’, ‘min_max’, ‘standard’].

  • augmentation (d3rlpy.augmentation.AugmentationPipeline or list(str)) – augmentation pipeline.

  • generator (d3rlpy.algos.base.DataGenerator) – dynamic dataset generator (e.g. model-based RL).

  • impl (d3rlpy.algos.torch.awr_impl.DiscreteAWRImpl) – algorithm implementation.

Methods

build_with_dataset(dataset)

Instantiate implementation object with MDPDataset object.

Parameters

dataset (d3rlpy.dataset.MDPDataset) – dataset.

Return type

None

build_with_env(env)

Instantiate implementation object with OpenAI Gym object.

Parameters

env (gym.core.Env) – gym-like environment.

Return type

None

create_impl(observation_shape, action_size)[source]

Instantiate implementation objects with the dataset shapes.

This method will be used internally when fit method is called.

Parameters
  • observation_shape (Sequence[int]) – observation shape.

  • action_size (int) – dimension of action-space.

Return type

None

fit(episodes, n_epochs=1000, save_metrics=True, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard=True, eval_episodes=None, save_interval=1, scorers=None, shuffle=True)

Trains with the given dataset.

algo.fit(episodes)
Parameters
  • episodes (List[d3rlpy.dataset.Episode]) – list of episodes to train.

  • n_epochs (int) – the number of epochs to train.

  • save_metrics (bool) – flag to record metrics in files. If False, the log directory is not created and the model parameters are not saved during training.

  • experiment_name (Optional[str]) – experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.

  • with_timestamp (bool) – flag to add timestamp string to the last of directory name.

  • logdir (str) – root directory name to save logs.

  • verbose (bool) – flag to show logged information on stdout.

  • show_progress (bool) – flag to show progress bar for iterations.

  • tensorboard (bool) – flag to save logged information in tensorboard (additional to the csv data)

  • eval_episodes (Optional[List[d3rlpy.dataset.Episode]]) – list of episodes to test.

  • save_interval (int) – interval to save parameters.

  • scorers (Optional[Dict[str, Callable[[Any, List[d3rlpy.dataset.Episode]], float]]]) – list of scorer functions used with eval_episodes.

  • shuffle (bool) – flag to shuffle transitions on each epoch.

Return type

None

fit_batch_online(env, buffer=None, explorer=None, n_epochs=1000, n_steps_per_epoch=1000, n_updates_per_epoch=1000, eval_interval=10, eval_env=None, eval_epsilon=0.0, save_metrics=True, save_interval=1, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard=True, timelimit_aware=True)

Start training loop of batch online deep reinforcement learning.

Parameters
  • env (d3rlpy.envs.batch.BatchEnv) – gym-like environment.

  • buffer (Optional[d3rlpy.online.buffers.BatchBuffer]) – replay buffer.

  • explorer (Optional[d3rlpy.online.explorers.Explorer]) – action explorer.

  • n_epochs (int) – the number of epochs to train.

  • n_steps_per_epoch (int) – the number of steps per epoch.

  • update_interval – the number of steps per update.

  • n_updates_per_epoch (int) – the number of updates per epoch.

  • eval_interval (int) – the number of epochs before evaluation.

  • eval_env (Optional[gym.core.Env]) – gym-like environment. If None, evaluation is skipped.

  • eval_epsilon (float) – \(\epsilon\)-greedy factor during evaluation.

  • save_metrics (bool) – flag to record metrics. If False, the log directory is not created and the model parameters are not saved.

  • save_interval (int) – the number of epochs before saving models.

  • experiment_name (Optional[str]) – experiment name for logging. If not passed, the directory name will be {class name}_online_{timestamp}.

  • with_timestamp (bool) – flag to add timestamp string to the last of directory name.

  • logdir (str) – root directory name to save logs.

  • verbose (bool) – flag to show logged information on stdout.

  • show_progress (bool) – flag to show progress bar for iterations.

  • tensorboard (bool) – flag to save logged information in tensorboard (additional to the csv data)

  • timelimit_aware (bool) – flag to turn terminal flag False when TimeLimit.truncated flag is True, which is designed to incorporate with gym.wrappers.TimeLimit.

Return type

None

fit_online(env, buffer=None, explorer=None, n_steps=1000000, n_steps_per_epoch=10000, update_interval=1, update_start_step=0, eval_env=None, eval_epsilon=0.0, save_metrics=True, save_interval=1, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard=True, timelimit_aware=True)

Start training loop of online deep reinforcement learning.

Parameters
  • env (gym.core.Env) – gym-like environment.

  • buffer (Optional[d3rlpy.online.buffers.Buffer]) – replay buffer.

  • explorer (Optional[d3rlpy.online.explorers.Explorer]) – action explorer.

  • n_steps (int) – the number of total steps to train.

  • n_steps_per_epoch (int) – the number of steps per epoch.

  • update_interval (int) – the number of steps per update.

  • update_start_step (int) – the steps before starting updates.

  • eval_env (Optional[gym.core.Env]) – gym-like environment. If None, evaluation is skipped.

  • eval_epsilon (float) – \(\epsilon\)-greedy factor during evaluation.

  • save_metrics (bool) – flag to record metrics. If False, the log directory is not created and the model parameters are not saved.

  • save_interval (int) – the number of epochs before saving models.

  • experiment_name (Optional[str]) – experiment name for logging. If not passed, the directory name will be {class name}_online_{timestamp}.

  • with_timestamp (bool) – flag to add timestamp string to the last of directory name.

  • logdir (str) – root directory name to save logs.

  • verbose (bool) – flag to show logged information on stdout.

  • show_progress (bool) – flag to show progress bar for iterations.

  • tensorboard (bool) – flag to save logged information in tensorboard (additional to the csv data)

  • timelimit_aware (bool) – flag to turn terminal flag False when TimeLimit.truncated flag is True, which is designed to incorporate with gym.wrappers.TimeLimit.

Return type

None

classmethod from_json(fname, use_gpu=False)

Returns algorithm configured with json file.

The Json file should be the one saved during fitting.

from d3rlpy.algos import Algo

# create algorithm with saved configuration
algo = Algo.from_json('d3rlpy_logs/<path-to-json>/params.json')

# ready to load
algo.load_model('d3rlpy_logs/<path-to-model>/model_100.pt')

# ready to predict
algo.predict(...)
Parameters
  • fname (str) – file path to params.json.

  • use_gpu (Optional[Union[bool, int, d3rlpy.gpu.Device]]) – flag to use GPU, device ID or device.

Returns

algorithm.

Return type

d3rlpy.base.LearnableBase

get_loss_labels()
Return type

List[str]

get_params(deep=True)

Returns the all attributes.

This method returns the all attributes including ones in subclasses. Some of scikit-learn utilities will use this method.

params = algo.get_params(deep=True)

# the returned values can be used to instantiate the new object.
algo2 = AlgoBase(**params)
Parameters

deep (bool) – flag to deeply copy objects such as impl.

Returns

attribute values in dictionary.

Return type

Dict[str, Any]

load_model(fname)

Load neural network parameters.

algo.load_model('model.pt')
Parameters

fname (str) – source file path.

Return type

None

predict(x)

Returns greedy actions.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

actions = algo.predict(x)
# actions.shape == (100, action size) for continuous control
# actions.shape == (100,) for discrete control
Parameters

x (Union[numpy.ndarray, List[Any]]) – observations

Returns

greedy actions

Return type

numpy.ndarray

predict_value(x, *args, **kwargs)

Returns predicted state values.

Parameters
  • x (Union[numpy.ndarray, List[Any]]) – observations.

  • args (Any) –

  • kwargs (Any) –

Returns

predicted state values.

Return type

numpy.ndarray

sample_action(x)

Returns sampled actions.

The sampled actions are identical to the output of predict method if the policy is deterministic.

Parameters

x (Union[numpy.ndarray, List[Any]]) – observations.

Returns

sampled actions.

Return type

numpy.ndarray

save_model(fname)

Saves neural network parameters.

algo.save_model('model.pt')
Parameters

fname (str) – destination file path.

Return type

None

save_params(logger)

Saves configurations as params.json.

Parameters

logger (d3rlpy.logger.D3RLPyLogger) – logger object.

Return type

None

save_policy(fname, as_onnx=False)

Save the greedy-policy computational graph as TorchScript or ONNX.

# save as TorchScript
algo.save_policy('policy.pt')

# save as ONNX
algo.save_policy('policy.onnx', as_onnx=True)

The artifacts saved with this method will work without d3rlpy. This method is especially useful to deploy the learned policy to production environments or embedding systems.

See also

Parameters
  • fname (str) – destination file path.

  • as_onnx (bool) – flag to save as ONNX format.

Return type

None

set_params(**params)

Sets the given arguments to the attributes if they exist.

This method sets the given values to the attributes including ones in subclasses. If the values that don’t exist as attributes are passed, they are ignored. Some of scikit-learn utilities will use this method.

algo.set_params(batch_size=100)
Parameters

params (Any) – arbitrary inputs to set as attributes.

Returns

itself.

Return type

d3rlpy.base.LearnableBase

update(epoch, total_step, batch)

Update parameters with mini-batch of data.

Parameters
Returns

loss values.

Return type

list

Attributes

action_scaler

Preprocessing action scaler.

Returns

preprocessing action scaler.

Return type

Optional[ActionScaler]

action_size

Action size.

Returns

action size.

Return type

Optional[int]

batch_size

Batch size to train.

Returns

batch size.

Return type

int

gamma

Discount factor.

Returns

discount factor.

Return type

float

impl

Implementation object.

Returns

implementation object.

Return type

Optional[ImplBase]

n_frames

Number of frames to stack.

This is only for image observation.

Returns

number of frames to stack.

Return type

int

n_steps

N-step TD backup.

Returns

N-step TD backup.

Return type

int

observation_shape

Observation shape.

Returns

observation shape.

Return type

Optional[Sequence[int]]

scaler

Preprocessing scaler.

Returns

preprocessing scaler.

Return type

Optional[Scaler]