d3rlpy.algos.BCQ¶

class d3rlpy.algos.BCQ(actor_learning_rate=0.001, critic_learning_rate=0.001, imitator_learning_rate=0.001, batch_size=100, n_frames=1, gamma=0.99, tau=0.005, n_critics=2, bootstrap=False, share_encoder=False, update_actor_interval=1, lam=0.75, n_action_samples=100, action_flexibility=0.05, rl_start_epoch=0, latent_size=32, beta=0.5, eps=1e-08, use_batch_norm=False, q_func_type='mean', n_epochs=1000, use_gpu=False, scaler=None, augmentation=[], n_augmentations=1, encoder_params={}, dynamics=None, impl=None, **kwargs)[source]¶

Batch-Constrained Q-learning algorithm.

BCQ is the very first practical data-driven deep reinforcement learning lgorithm. The major difference from DDPG is that the policy function is represented as combination of conditional VAE and perturbation function in order to remedy extrapolation error emerging from target value estimation.

The encoder and the decoder of the conditional VAE is represented as \(E_\omega\) and \(D_\omega\) respectively.

\[L(\omega) = E_{s_t, a_t \sim D} [(a - \tilde{a})^2 + D_{KL}(N(\mu, \sigma)|N(0, 1))]\]

where \(\mu, \sigma = E_\omega(s_t, a_t)\), \(\tilde{a} = D_\omega(s_t, z)\) and \(z \sim N(\mu, \sigma)\).

The policy function is represented as a residual function with the VAE and the perturbation function represented as \(\xi_\phi (s, a)\).

\[\pi(s, a) = a + \Phi \xi_\phi (s, a)\]

where \(a = D_\omega (s, z)\), \(z \sim N(0, 0.5)\) and \(\Phi\) is a perturbation scale designated by action_flexibility. Although the policy is learned closely to data distribution, the perturbation function can lead to more rewarded states.

BCQ also leverages twin Q functions and computes weighted average over maximum values and minimum values.

\[L(\theta_i) = \mathbb{E}_{s_t, a_t, r_{t+1}, s_{t+1} \sim D} [(y - Q_{\theta_i}(s_t, a_t))^2]\]

\[y = r_{t+1} + \gamma \max_{a_i} [ \lambda \min_j Q_{\theta_j'}(s_{t+1}, a_i) + (1 - \lambda) \max_j Q_{\theta_j'}(s_{t+1}, a_i)]\]

where \(\{a_i \sim D(s_{t+1}, z), z \sim N(0, 0.5)\}_{i=1}^n\). The number of sampled actions is designated with n_action_samples.

Finally, the perturbation function is trained just like DDPG’s policy function.

\[J(\phi) = \mathbb{E}_{s_t \sim D, a_t \sim D_\omega(s_t, z), z \sim N(0, 0.5)} [Q_{\theta_1} (s_t, \pi(s_t, a_t))]\]

At inference time, action candidates are sampled as many as n_action_samples, and the action with highest value estimation is taken.

\[\pi'(s) = \text{argmax}_{\pi(s, a_i)} Q_{\theta_1} (s, \pi(s, a_i))\]

Note

The greedy action is not deterministic because the action candidates are always randomly sampled. This might affect save_policy method and the performance at production.

References

Fujimoto et al., Off-Policy Deep Reinforcement Learning without Exploration.

Parameters:

actor_learning_rate (float) – learning rate for policy function.
critic_learning_rate (float) – learning rate for Q functions.
imitator_learning_rate (float) – learning rate for Conditional VAE.
batch_size (int) – mini-batch size.
n_frames (int) – the number of frames to stack for image observation.
gamma (float) – discount factor.
tau (float) – target network synchronization coefficiency.
n_critics (int) – the number of Q functions for ensemble.
bootstrap (bool) – flag to bootstrap Q functions.
share_encoder (bool) – flag to share encoder network.
update_actor_interval (int) – interval to update policy function.
lam (float) – weight factor for critic ensemble.
n_action_samples (int) – the number of action samples to estimate action-values.
action_flexibility (float) – output scale of perturbation function represented as \(\Phi\).
rl_start_epoch (int) – epoch to start to update policy function and Q functions. If this is large, RL training would be more stabilized.
latent_size (int) – size of latent vector for Conditional VAE.
beta (float) – KL reguralization term for Conditional VAE.
eps (float) – \(\epsilon\) for Adam optimizer.
use_batch_norm (bool) – flag to insert batch normalization layers.
q_func_type (str) – type of Q function. Available options are [‘mean’, ‘qr’, ‘iqn’, ‘fqf’].
n_epochs (int) – the number of eopchs to train.
use_gpu (bool, int or d3rlpy.gpu.Device) – flag to use GPU, device ID or device.
scaler (d3rlpy.preprocessing.Scaler or str) – preprocessor. The available options are [‘pixel’, ‘min_max’, ‘standard’]
augmentation (d3rlpy.augmentation.AugmentationPipeline or list(str)) – augmentation pipeline.
n_augmentations (int) – the number of data augmentations to update.
encoder_params (dict) – optional arguments for encoder setup. If the observation is pixel, you can pass filters with list of tuples consisting with (filter_size, kernel_size, stride) and feature_size with an integer scaler for the last linear layer size. If the observation is vector, you can pass hidden_units with list of hidden unit sizes.
dynamics (d3rlpy.dynamics.base.DynamicsBase) – dynamics model for data augmentation.
impl (d3rlpy.algos.torch.bcq_impl.BCQImpl) – algorithm implementation.

actor_learning_rate¶

learning rate for policy function.

Type:	float

critic_learning_rate¶

learning rate for Q functions.

Type:	float

imitator_learning_rate¶

learning rate for Conditional VAE.

Type:	float

batch_size¶

mini-batch size.

Type:	int

n_frames¶

the number of frames to stack for image observation.

Type:	int

gamma¶

discount factor.

Type:	float

tau¶

target network synchronization coefficiency.

Type:	float

n_critics¶

the number of Q functions for ensemble.

Type:	int

bootstrap¶

flag to bootstrap Q functions.

Type:	bool

share_encoder¶

flag to share encoder network.

Type:	bool

update_actor_interval¶

interval to update policy function.

Type:	int

lam¶

weight factor for critic ensemble.

Type:	float

n_action_samples¶

the number of action samples to estimate action-values.

Type:	int

action_flexibility¶

output scale of perturbation function.

Type:	float

rl_start_epoch¶

epoch to start to update policy function and Q functions.

Type:	int

latent_size¶

size of latent vector for Conditional VAE.

Type:	int

beta¶

KL reguralization term for Conditional VAE.

Type:	float

eps¶

\(\epsilon\) for Adam optimizer.

Type:	float

use_batch_norm¶

flag to insert batch normalization layers.

Type:	bool

q_func_type¶

type of Q function.

Type:	str

n_epochs¶

the number of eopchs to train.

Type:	int

use_gpu¶

GPU device.

Type:	d3rlpy.gpu.Device

scaler¶

preprocessor.

Type:	d3rlpy.preprocessing.Scaler

augmentation¶

augmentation pipeline.

Type:	d3rlpy.augmentation.AugmentationPipeline

n_augmentations¶

the number of data augmentations to update.

Type:	int

encoder_params¶

optional arguments for encoder setup.

Type:	dict

dynamics¶

dynamics model.

Type:	d3rlpy.dynamics.base.DynamicsBase

impl¶

algorithm implementation.

Type:	d3rlpy.algos.torch.bcq_impl.BCQImpl

eval_results_¶

evaluation results.

Type:	dict

Methods

create_impl(observation_shape, action_size)[source]¶

Instantiate implementation objects with the dataset shapes.

This method will be used internally when fit method is called.

Parameters:	observation_shape (tuple) – observation shape. action_size (int) – dimension of action-space.

fit(episodes, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard=True, eval_episodes=None, save_interval=1, scorers=None)¶

Trains with the given dataset.

algo.fit(episodes)

Parameters:

episodes (list(d3rlpy.dataset.Episode)) – list of episodes to train.
experiment_name (str) – experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.
with_timestamp (bool) – flag to add timestamp string to the last of directory name.
logdir (str) – root directory name to save logs.
verbose (bool) – flag to show logged information on stdout.
show_progress (bool) – flag to show progress bar for iterations.
tensorboard (bool) – flag to save logged information in tensorboard (additional to the csv data)
eval_episodes (list(d3rlpy.dataset.Episode)) – list of episodes to test.
save_interval (int) – interval to save parameters.
scorers (list(callable)) – list of scorer functions used with eval_episodes.

classmethod from_json(fname, use_gpu=False)¶

Returns algorithm configured with json file.

The Json file should be the one saved during fitting.

from d3rlpy.algos import Algo

# create algorithm with saved configuration
algo = Algo.from_json('d3rlpy_logs/<path-to-json>/params.json')

# ready to load
algo.load_model('d3rlpy_logs/<path-to-model>/model_100.pt')

# ready to predict
algo.predict(...)

Parameters:	fname (str) – file path to params.json. use_gpu (bool, int or d3rlpy.gpu.Device) – flag to use GPU, device ID or device.
Returns:	algorithm.
Return type:	d3rlpy.base.LearnableBase

get_params(deep=True)¶

Returns the all attributes.

This method returns the all attributes including ones in subclasses. Some of scikit-learn utilities will use this method.

params = algo.get_params(deep=True)

# the returned values can be used to instantiate the new object.
algo2 = AlgoBase(**params)

Parameters:	deep (bool) – flag to deeply copy objects such as impl.
Returns:	attribute values in dictionary.
Return type:	dict

load_model(fname)¶

Load neural network parameters.

algo.load_model('model.pt')

Parameters:	fname (str) – source file path.

predict(x)¶

Returns greedy actions.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

actions = algo.predict(x)
# actions.shape == (100, action size) for continuous control
# actions.shape == (100,) for discrete control

Parameters:	x (numpy.ndarray) – observations
Returns:	greedy actions
Return type:	numpy.ndarray

predict_value(x, action, with_std=False)¶

Returns predicted action-values.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

# for continuous control
# 100 actions with shape of (2,)
actions = np.random.random((100, 2))

# for discrete control
# 100 actions in integer values
actions = np.random.randint(2, size=100)

values = algo.predict_value(x, actions)
# values.shape == (100,)

values, stds = algo.predict_value(x, actions, with_std=True)
# stds.shape  == (100,)

Parameters:	x (numpy.ndarray) – observations action (numpy.ndarray) – actions with_std (bool) – flag to return standard deviation of ensemble estimation. This deviation reflects uncertainty for the given observations. This uncertainty will be more accurate if you enable bootstrap flag and increase n_critics value.
Returns:	predicted action-values
Return type:	numpy.ndarray

sample_action(x)[source]¶: BCQ does not support sampling action.

save_model(fname)¶

Saves neural network parameters.

algo.save_model('model.pt')

Parameters:	fname (str) – destination file path.

save_policy(fname, as_onnx=False)¶

Save the greedy-policy computational graph as TorchScript or ONNX.

# save as TorchScript
algo.save_policy('policy.pt')

# save as ONNX
algo.save_policy('policy.onnx', as_onnx=True)

The artifacts saved with this method will work without d3rlpy. This method is especially useful to deploy the learned policy to production environments or embedding systems.

Parameters:	**params – arbitrary inputs to set as attributes.
Returns:	itself.
Return type:	d3rlpy.algos.base.AlgoBase