d3rlpy.algos.DiscreteBC¶

class d3rlpy.algos.DiscreteBC(*, learning_rate=0.001, optim_factory=d3rlpy.models.optimizers.AdamFactory(optim_cls='Adam', betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False), encoder_factory='default', batch_size=100, n_frames=1, beta=0.5, use_gpu=False, scaler=None, impl=None, **kwargs)[source]¶

Behavior Cloning algorithm for discrete control.

Behavior Cloning (BC) is to imitate actions in the dataset via a supervised learning approach. Since BC is only imitating action distributions, the performance will be close to the mean of the dataset even though BC mostly works better than online RL algorithms.

\[L(\theta) = \mathbb{E}_{a_t, s_t \sim D} [-\sum_a p(a|s_t) \log \pi_\theta(a|s_t)]\]

where \(p(a|s_t)\) is implemented as a one-hot vector.

Parameters

learning_rate (float) – learing rate.
optim_factory (d3rlpy.models.optimizers.OptimizerFactory) – optimizer factory.
encoder_factory (d3rlpy.models.encoders.EncoderFactory or str) – encoder factory.
batch_size (int) – mini-batch size.
n_frames (int) – the number of frames to stack for image observation.
beta (float) – reguralization factor.
use_gpu (bool, int or d3rlpy.gpu.Device) – flag to use GPU, device ID or device.
scaler (d3rlpy.preprocessing.Scaler or str) – preprocessor. The available options are [‘pixel’, ‘min_max’, ‘standard’]
impl (d3rlpy.algos.torch.bc_impl.DiscreteBCImpl) – implemenation of the algorithm.
kwargs (Any) –

Methods

build_with_dataset(dataset)¶

Instantiate implementation object with MDPDataset object.

Parameters: dataset (d3rlpy.dataset.MDPDataset) – dataset.
Return type: None

build_with_env(env)¶

Instantiate implementation object with OpenAI Gym object.

Parameters: env (gym.core.Env) – gym-like environment.
Return type: None

collect(env, buffer=None, explorer=None, n_steps=1000000, show_progress=True, timelimit_aware=True)¶

Collects data via interaction with environment.

If buffer is not given, ReplayBuffer will be internally created.

Parameters

env (gym.core.Env) – gym-like environment.
buffer (Optional[d3rlpy.online.buffers.Buffer]) – replay buffer.
explorer (Optional[d3rlpy.online.explorers.Explorer]) – action explorer.
n_steps (int) – the number of total steps to train.
show_progress (bool) – flag to show progress bar for iterations.
timelimit_aware (bool) – flag to turn terminal flag False when TimeLimit.truncated flag is True, which is designed to incorporate with gym.wrappers.TimeLimit.

Returns

replay buffer with the collected data.

Return type

d3rlpy.online.buffers.Buffer

copy_policy_from(algo)¶

Copies policy parameters from the given algorithm.

# pretrain with static dataset
cql = d3rlpy.algos.CQL()
cql.fit(dataset, n_steps=100000)

# transfer to online algorithmn
sac = d3rlpy.algos.SAC()
sac.create_impl(cql.observation_shape, cql.action_size)
sac.copy_policy_from(cql)

Parameters: algo (d3rlpy.algos.base.AlgoBase) – algorithm object.
Return type: None

copy_q_function_from(algo)¶

Copies Q-function parameters from the given algorithm.

# pretrain with static dataset
cql = d3rlpy.algos.CQL()
cql.fit(dataset, n_steps=100000)

# transfer to online algorithmn
sac = d3rlpy.algos.SAC()
sac.create_impl(cql.observation_shape, cql.action_size)
sac.copy_q_function_from(cql)

Parameters: algo (d3rlpy.algos.base.AlgoBase) – algorithm object.
Return type: None

create_impl(observation_shape, action_size)¶

Instantiate implementation objects with the dataset shapes.

This method will be used internally when fit method is called.

Parameters

observation_shape (Sequence[int]) – observation shape.
action_size (int) – dimension of action-space.

Return type

None

fit(dataset, n_epochs=None, n_steps=None, n_steps_per_epoch=10000, save_metrics=True, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard_dir=None, eval_episodes=None, save_interval=1, scorers=None, shuffle=True, callback=None)¶

Trains with the given dataset.

algo.fit(episodes, n_steps=1000000)

Parameters

dataset (Union[List[d3rlpy.dataset.Episode], d3rlpy.dataset.MDPDataset]) – list of episodes to train.
n_epochs (Optional[int]) – the number of epochs to train.
n_steps (Optional[int]) – the number of steps to train.
n_steps_per_epoch (int) – the number of steps per epoch. This value will be ignored when n_steps is None.
save_metrics (bool) – flag to record metrics in files. If False, the log directory is not created and the model parameters are not saved during training.
experiment_name (Optional[str]) – experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.
with_timestamp (bool) – flag to add timestamp string to the last of directory name.
logdir (str) – root directory name to save logs.
verbose (bool) – flag to show logged information on stdout.
show_progress (bool) – flag to show progress bar for iterations.
tensorboard_dir (Optional[str]) – directory to save logged information in tensorboard (additional to the csv data). if None, the directory will not be created.
eval_episodes (Optional[List[d3rlpy.dataset.Episode]]) – list of episodes to test.
save_interval (int) – interval to save parameters.
scorers (Optional[Dict[str, Callable[[Any, List[d3rlpy.dataset.Episode]], float]]]) – list of scorer functions used with eval_episodes.
shuffle (bool) – flag to shuffle transitions on each epoch.
callback (Optional[Callable[[d3rlpy.base.LearnableBase, int, int], None]]) – callable function that takes (algo, epoch, total_step) , which is called every step.

Returns

list of result tuples (epoch, metrics) per epoch.

Return type

List[Tuple[int, Dict[str, float]]]

fit_batch_online(env, buffer=None, explorer=None, n_epochs=1000, n_steps_per_epoch=1000, n_updates_per_epoch=1000, eval_interval=10, eval_env=None, eval_epsilon=0.0, save_metrics=True, save_interval=1, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard_dir=None, timelimit_aware=True, callback=None)¶

Start training loop of batch online deep reinforcement learning.

Parameters

env (d3rlpy.envs.batch.BatchEnv) – gym-like environment.
buffer (Optional[d3rlpy.online.buffers.BatchBuffer]) – replay buffer.
explorer (Optional[d3rlpy.online.explorers.Explorer]) – action explorer.
n_epochs (int) – the number of epochs to train.
n_steps_per_epoch (int) – the number of steps per epoch.
update_interval – the number of steps per update.
n_updates_per_epoch (int) – the number of updates per epoch.
eval_interval (int) – the number of epochs before evaluation.
eval_env (Optional[gym.core.Env]) – gym-like environment. If None, evaluation is skipped.
eval_epsilon (float) – \(\epsilon\)-greedy factor during evaluation.
save_metrics (bool) – flag to record metrics. If False, the log directory is not created and the model parameters are not saved.
save_interval (int) – the number of epochs before saving models.
experiment_name (Optional[str]) – experiment name for logging. If not passed, the directory name will be {class name}_online_{timestamp}.
with_timestamp (bool) – flag to add timestamp string to the last of directory name.
logdir (str) – root directory name to save logs.
verbose (bool) – flag to show logged information on stdout.
show_progress (bool) – flag to show progress bar for iterations.
tensorboard_dir (Optional[str]) – directory to save logged information in tensorboard (additional to the csv data). if None, the directory will not be created.
timelimit_aware (bool) – flag to turn terminal flag False when TimeLimit.truncated flag is True, which is designed to incorporate with gym.wrappers.TimeLimit.
callback (Optional[Callable[[d3rlpy.online.iterators.AlgoProtocol, int, int], None]]) – callable function that takes (algo, epoch, total_step) , which is called at the end of epochs.

Return type

None

fit_online(env, buffer=None, explorer=None, n_steps=1000000, n_steps_per_epoch=10000, update_interval=1, update_start_step=0, eval_env=None, eval_epsilon=0.0, save_metrics=True, save_interval=1, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard_dir=None, timelimit_aware=True, callback=None)¶

Start training loop of online deep reinforcement learning.

Parameters

env (gym.core.Env) – gym-like environment.
buffer (Optional[d3rlpy.online.buffers.Buffer]) – replay buffer.
explorer (Optional[d3rlpy.online.explorers.Explorer]) – action explorer.
n_steps (int) – the number of total steps to train.
n_steps_per_epoch (int) – the number of steps per epoch.
update_interval (int) – the number of steps per update.
update_start_step (int) – the steps before starting updates.
eval_env (Optional[gym.core.Env]) – gym-like environment. If None, evaluation is skipped.
eval_epsilon (float) – \(\epsilon\)-greedy factor during evaluation.
save_metrics (bool) – flag to record metrics. If False, the log directory is not created and the model parameters are not saved.
save_interval (int) – the number of epochs before saving models.
experiment_name (Optional[str]) – experiment name for logging. If not passed, the directory name will be {class name}_online_{timestamp}.
with_timestamp (bool) – flag to add timestamp string to the last of directory name.
logdir (str) – root directory name to save logs.
verbose (bool) – flag to show logged information on stdout.
show_progress (bool) – flag to show progress bar for iterations.
tensorboard_dir (Optional[str]) – directory to save logged information in tensorboard (additional to the csv data). if None, the directory will not be created.
timelimit_aware (bool) – flag to turn terminal flag False when TimeLimit.truncated flag is True, which is designed to incorporate with gym.wrappers.TimeLimit.
callback (Optional[Callable[[d3rlpy.online.iterators.AlgoProtocol, int, int], None]]) – callable function that takes (algo, epoch, total_step) , which is called at the end of epochs.

Return type

None

fitter(dataset, n_epochs=None, n_steps=None, n_steps_per_epoch=10000, save_metrics=True, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard_dir=None, eval_episodes=None, save_interval=1, scorers=None, shuffle=True, callback=None)¶

Iterate over epochs steps to train with the given dataset. At each: iteration algo methods and properties can be changed or queried.

for epoch, metrics in algo.fitter(episodes):
    my_plot(metrics)
    algo.save_model(my_path)

Parameters

dataset (Union[List[d3rlpy.dataset.Episode], d3rlpy.dataset.MDPDataset]) – list of episodes to train.
n_epochs (Optional[int]) – the number of epochs to train.
n_steps (Optional[int]) – the number of steps to train.
n_steps_per_epoch (int) – the number of steps per epoch. This value will be ignored when n_steps is None.
save_metrics (bool) – flag to record metrics in files. If False, the log directory is not created and the model parameters are not saved during training.
experiment_name (Optional[str]) – experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.
with_timestamp (bool) – flag to add timestamp string to the last of directory name.
logdir (str) – root directory name to save logs.
verbose (bool) – flag to show logged information on stdout.
show_progress (bool) – flag to show progress bar for iterations.
tensorboard_dir (Optional[str]) – directory to save logged information in tensorboard (additional to the csv data). if None, the directory will not be created.
eval_episodes (Optional[List[d3rlpy.dataset.Episode]]) – list of episodes to test.
save_interval (int) – interval to save parameters.
scorers (Optional[Dict[str, Callable[[Any, List[d3rlpy.dataset.Episode]], float]]]) – list of scorer functions used with eval_episodes.
shuffle (bool) – flag to shuffle transitions on each epoch.
callback (Optional[Callable[[d3rlpy.base.LearnableBase, int, int], None]]) – callable function that takes (algo, epoch, total_step) , which is called every step.

Returns

iterator yielding current epoch and metrics dict.

Return type

Generator[Tuple[int, Dict[str, float]], None, None]

classmethod from_json(fname, use_gpu=False)¶

Returns algorithm configured with json file.

The Json file should be the one saved during fitting.

from d3rlpy.algos import Algo

# create algorithm with saved configuration
algo = Algo.from_json('d3rlpy_logs/<path-to-json>/params.json')

# ready to load
algo.load_model('d3rlpy_logs/<path-to-model>/model_100.pt')

# ready to predict
algo.predict(...)

Parameters

fname (str) – file path to params.json.
use_gpu (Optional[Union[bool, int, d3rlpy.gpu.Device]]) – flag to use GPU, device ID or device.

Returns

algorithm.

Return type

d3rlpy.base.LearnableBase

generate_new_data(transitions)¶

Returns generated transitions for data augmentation.

This method is for model-based RL algorithms.

Parameters: transitions (List[d3rlpy.dataset.Transition]) – list of transitions.
Returns: list of new transitions.
Return type: Optional[List[d3rlpy.dataset.Transition]]

get_action_type()[source]¶

Returns action type (continuous or discrete).

Returns: action type.
Return type: d3rlpy.constants.ActionSpace

get_params(deep=True)¶

Returns the all attributes.

This method returns the all attributes including ones in subclasses. Some of scikit-learn utilities will use this method.

params = algo.get_params(deep=True)

# the returned values can be used to instantiate the new object.
algo2 = AlgoBase(**params)

Parameters: deep (bool) – flag to deeply copy objects such as impl.
Returns: attribute values in dictionary.
Return type: Dict[str, Any]

load_model(fname)¶

Load neural network parameters.

algo.load_model('model.pt')

Parameters: fname (str) – source file path.
Return type: None

predict(x)¶

Returns greedy actions.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

actions = algo.predict(x)
# actions.shape == (100, action size) for continuous control
# actions.shape == (100,) for discrete control

Parameters: x (Union[numpy.ndarray, List[Any]]) – observations
Returns: greedy actions
Return type: numpy.ndarray

predict_value(x, action, with_std=False)¶

value prediction is not supported by BC algorithms.

Parameters

x (Union[numpy.ndarray, List[Any]]) –
action (Union[numpy.ndarray, List[Any]]) –
with_std (bool) –

Return type

numpy.ndarray

sample_action(x)¶

sampling action is not supported by BC algorithm.

Parameters: x (Union[numpy.ndarray, List[Any]]) –
Return type: None

save_model(fname)¶

Saves neural network parameters.

algo.save_model('model.pt')

Parameters: fname (str) – destination file path.
Return type: None

save_params(logger)¶

Saves configurations as params.json.

Parameters: logger (d3rlpy.logger.D3RLPyLogger) – logger object.
Return type: None

save_policy(fname, as_onnx=False)¶

Save the greedy-policy computational graph as TorchScript or ONNX.

# save as TorchScript
algo.save_policy('policy.pt')

# save as ONNX
algo.save_policy('policy.onnx', as_onnx=True)

The artifacts saved with this method will work without d3rlpy. This method is especially useful to deploy the learned policy to production environments or embedding systems.