d3rlpy.algos.BCQ

class d3rlpy.algos.BCQ(actor_learning_rate=0.001, critic_learning_rate=0.001, imitator_learning_rate=0.001, batch_size=100, n_frames=1, gamma=0.99, tau=0.005, n_critics=2, bootstrap=False, share_encoder=False, update_actor_interval=1, lam=0.75, n_action_samples=100, action_flexibility=0.05, rl_start_epoch=0, latent_size=32, beta=0.5, eps=1e-08, use_batch_norm=False, q_func_type='mean', n_epochs=1000, use_gpu=False, scaler=None, augmentation=[], n_augmentations=1, encoder_params={}, dynamics=None, impl=None, **kwargs)[source]

Batch-Constrained Q-learning algorithm.

BCQ is the very first practical data-driven deep reinforcement learning lgorithm. The major difference from DDPG is that the policy function is represented as combination of conditional VAE and perturbation function in order to remedy extrapolation error emerging from target value estimation.

The encoder and the decoder of the conditional VAE is represented as \(E_\omega\) and \(D_\omega\) respectively.

\[L(\omega) = E_{s_t, a_t \sim D} [(a - \tilde{a})^2 + D_{KL}(N(\mu, \sigma)|N(0, 1))]\]

where \(\mu, \sigma = E_\omega(s_t, a_t)\), \(\tilde{a} = D_\omega(s_t, z)\) and \(z \sim N(\mu, \sigma)\).

The policy function is represented as a residual function with the VAE and the perturbation function represented as \(\xi_\phi (s, a)\).

\[\pi(s, a) = a + \Phi \xi_\phi (s, a)\]

where \(a = D_\omega (s, z)\), \(z \sim N(0, 0.5)\) and \(\Phi\) is a perturbation scale designated by action_flexibility. Although the policy is learned closely to data distribution, the perturbation function can lead to more rewarded states.

BCQ also leverages twin Q functions and computes weighted average over maximum values and minimum values.

\[L(\theta_i) = \mathbb{E}_{s_t, a_t, r_{t+1}, s_{t+1} \sim D} [(y - Q_{\theta_i}(s_t, a_t))^2]\]
\[y = r_{t+1} + \gamma \max_{a_i} [ \lambda \min_j Q_{\theta_j'}(s_{t+1}, a_i) + (1 - \lambda) \max_j Q_{\theta_j'}(s_{t+1}, a_i)]\]

where \(\{a_i \sim D(s_{t+1}, z), z \sim N(0, 0.5)\}_{i=1}^n\). The number of sampled actions is designated with n_action_samples.

Finally, the perturbation function is trained just like DDPG’s policy function.

\[J(\phi) = \mathbb{E}_{s_t \sim D, a_t \sim D_\omega(s_t, z), z \sim N(0, 0.5)} [Q_{\theta_1} (s_t, \pi(s_t, a_t))]\]

At inference time, action candidates are sampled as many as n_action_samples, and the action with highest value estimation is taken.

\[\pi'(s) = \text{argmax}_{\pi(s, a_i)} Q_{\theta_1} (s, \pi(s, a_i))\]

Note

The greedy action is not deterministic because the action candidates are always randomly sampled. This might affect save_policy method and the performance at production.

References

Parameters:
  • actor_learning_rate (float) – learning rate for policy function.
  • critic_learning_rate (float) – learning rate for Q functions.
  • imitator_learning_rate (float) – learning rate for Conditional VAE.
  • batch_size (int) – mini-batch size.
  • n_frames (int) – the number of frames to stack for image observation.
  • gamma (float) – discount factor.
  • tau (float) – target network synchronization coefficiency.
  • n_critics (int) – the number of Q functions for ensemble.
  • bootstrap (bool) – flag to bootstrap Q functions.
  • share_encoder (bool) – flag to share encoder network.
  • update_actor_interval (int) – interval to update policy function.
  • lam (float) – weight factor for critic ensemble.
  • n_action_samples (int) – the number of action samples to estimate action-values.
  • action_flexibility (float) – output scale of perturbation function represented as \(\Phi\).
  • rl_start_epoch (int) – epoch to start to update policy function and Q functions. If this is large, RL training would be more stabilized.
  • latent_size (int) – size of latent vector for Conditional VAE.
  • beta (float) – KL reguralization term for Conditional VAE.
  • eps (float) – \(\epsilon\) for Adam optimizer.
  • use_batch_norm (bool) – flag to insert batch normalization layers.
  • q_func_type (str) – type of Q function. Available options are [‘mean’, ‘qr’, ‘iqn’, ‘fqf’].
  • n_epochs (int) – the number of eopchs to train.
  • use_gpu (bool, int or d3rlpy.gpu.Device) – flag to use GPU, device ID or device.
  • scaler (d3rlpy.preprocessing.Scaler or str) – preprocessor. The available options are [‘pixel’, ‘min_max’, ‘standard’]
  • augmentation (d3rlpy.augmentation.AugmentationPipeline or list(str)) – augmentation pipeline.
  • n_augmentations (int) – the number of data augmentations to update.
  • encoder_params (dict) – optional arguments for encoder setup. If the observation is pixel, you can pass filters with list of tuples consisting with (filter_size, kernel_size, stride) and feature_size with an integer scaler for the last linear layer size. If the observation is vector, you can pass hidden_units with list of hidden unit sizes.
  • dynamics (d3rlpy.dynamics.base.DynamicsBase) – dynamics model for data augmentation.
  • impl (d3rlpy.algos.torch.bcq_impl.BCQImpl) – algorithm implementation.
actor_learning_rate

learning rate for policy function.

Type:float
critic_learning_rate

learning rate for Q functions.

Type:float
imitator_learning_rate

learning rate for Conditional VAE.

Type:float
batch_size

mini-batch size.

Type:int
n_frames

the number of frames to stack for image observation.

Type:int
gamma

discount factor.

Type:float
tau

target network synchronization coefficiency.

Type:float
n_critics

the number of Q functions for ensemble.

Type:int
bootstrap

flag to bootstrap Q functions.

Type:bool
share_encoder

flag to share encoder network.

Type:bool
update_actor_interval

interval to update policy function.

Type:int
lam

weight factor for critic ensemble.

Type:float
n_action_samples

the number of action samples to estimate action-values.

Type:int
action_flexibility

output scale of perturbation function.

Type:float
rl_start_epoch

epoch to start to update policy function and Q functions.

Type:int
latent_size

size of latent vector for Conditional VAE.

Type:int
beta

KL reguralization term for Conditional VAE.

Type:float
eps

\(\epsilon\) for Adam optimizer.

Type:float
use_batch_norm

flag to insert batch normalization layers.

Type:bool
q_func_type

type of Q function.

Type:str
n_epochs

the number of eopchs to train.

Type:int
use_gpu

GPU device.

Type:d3rlpy.gpu.Device
scaler

preprocessor.

Type:d3rlpy.preprocessing.Scaler
augmentation

augmentation pipeline.

Type:d3rlpy.augmentation.AugmentationPipeline
n_augmentations

the number of data augmentations to update.

Type:int
encoder_params

optional arguments for encoder setup.

Type:dict
dynamics

dynamics model.

Type:d3rlpy.dynamics.base.DynamicsBase
impl

algorithm implementation.

Type:d3rlpy.algos.torch.bcq_impl.BCQImpl
eval_results_

evaluation results.

Type:dict

Methods

create_impl(observation_shape, action_size)[source]

Instantiate implementation objects with the dataset shapes.

This method will be used internally when fit method is called.

Parameters:
  • observation_shape (tuple) – observation shape.
  • action_size (int) – dimension of action-space.
fit(episodes, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard=True, eval_episodes=None, save_interval=1, scorers=None)

Trains with the given dataset.

algo.fit(episodes)
Parameters:
  • episodes (list(d3rlpy.dataset.Episode)) – list of episodes to train.
  • experiment_name (str) – experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.
  • with_timestamp (bool) – flag to add timestamp string to the last of directory name.
  • logdir (str) – root directory name to save logs.
  • verbose (bool) – flag to show logged information on stdout.
  • show_progress (bool) – flag to show progress bar for iterations.
  • tensorboard (bool) – flag to save logged information in tensorboard (additional to the csv data)
  • eval_episodes (list(d3rlpy.dataset.Episode)) – list of episodes to test.
  • save_interval (int) – interval to save parameters.
  • scorers (list(callable)) – list of scorer functions used with eval_episodes.
classmethod from_json(fname, use_gpu=False)

Returns algorithm configured with json file.

The Json file should be the one saved during fitting.

from d3rlpy.algos import Algo

# create algorithm with saved configuration
algo = Algo.from_json('d3rlpy_logs/<path-to-json>/params.json')

# ready to load
algo.load_model('d3rlpy_logs/<path-to-model>/model_100.pt')

# ready to predict
algo.predict(...)
Parameters:
  • fname (str) – file path to params.json.
  • use_gpu (bool, int or d3rlpy.gpu.Device) – flag to use GPU, device ID or device.
Returns:

algorithm.

Return type:

d3rlpy.base.LearnableBase

get_params(deep=True)

Returns the all attributes.

This method returns the all attributes including ones in subclasses. Some of scikit-learn utilities will use this method.

params = algo.get_params(deep=True)

# the returned values can be used to instantiate the new object.
algo2 = AlgoBase(**params)
Parameters:deep (bool) – flag to deeply copy objects such as impl.
Returns:attribute values in dictionary.
Return type:dict
load_model(fname)

Load neural network parameters.

algo.load_model('model.pt')
Parameters:fname (str) – source file path.
predict(x)

Returns greedy actions.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

actions = algo.predict(x)
# actions.shape == (100, action size) for continuous control
# actions.shape == (100,) for discrete control
Parameters:x (numpy.ndarray) – observations
Returns:greedy actions
Return type:numpy.ndarray
predict_value(x, action, with_std=False)

Returns predicted action-values.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

# for continuous control
# 100 actions with shape of (2,)
actions = np.random.random((100, 2))

# for discrete control
# 100 actions in integer values
actions = np.random.randint(2, size=100)

values = algo.predict_value(x, actions)
# values.shape == (100,)

values, stds = algo.predict_value(x, actions, with_std=True)
# stds.shape  == (100,)
Parameters:
  • x (numpy.ndarray) – observations
  • action (numpy.ndarray) – actions
  • with_std (bool) – flag to return standard deviation of ensemble estimation. This deviation reflects uncertainty for the given observations. This uncertainty will be more accurate if you enable bootstrap flag and increase n_critics value.
Returns:

predicted action-values

Return type:

numpy.ndarray

sample_action(x)[source]

BCQ does not support sampling action.

save_model(fname)

Saves neural network parameters.

algo.save_model('model.pt')
Parameters:fname (str) – destination file path.
save_policy(fname, as_onnx=False)

Save the greedy-policy computational graph as TorchScript or ONNX.

# save as TorchScript
algo.save_policy('policy.pt')

# save as ONNX
algo.save_policy('policy.onnx', as_onnx=True)

The artifacts saved with this method will work without d3rlpy. This method is especially useful to deploy the learned policy to production environments or embedding systems.

See also

Parameters:
  • fname (str) – destination file path.
  • as_onnx (bool) – flag to save as ONNX format.
set_params(**params)

Sets the given arguments to the attributes if they exist.

This method sets the given values to the attributes including ones in subclasses. If the values that don’t exist as attributes are passed, they are ignored. Some of scikit-learn utilities will use this method.

algo.set_params(n_epochs=10, batch_size=100)
Parameters:**params – arbitrary inputs to set as attributes.
Returns:itself.
Return type:d3rlpy.algos.base.AlgoBase
update(epoch, total_step, batch)[source]

Update parameters with mini-batch of data.

Parameters:
Returns:

loss values.

Return type:

list