d3rlpy.algos.BEAR¶

class d3rlpy.algos.BEAR(actor_learning_rate=0.0003, critic_learning_rate=0.0003, imitator_learning_rate=0.001, temp_learning_rate=0.0003, alpha_learning_rate=0.001, batch_size=100, gamma=0.99, tau=0.005, n_critics=2, bootstrap=False, share_encoder=False, update_actor_interval=1, initial_temperature=1.0, initial_alpha=1.0, alpha_threshold=0.05, lam=0.75, n_action_samples=4, mmd_sigma=20.0, rl_start_epoch=0, eps=1e-08, use_batch_norm=False, q_func_type='mean', n_epochs=1000, use_gpu=False, scaler=None, augmentation=[], n_augmentations=1, encoder_params={}, dynamics=None, impl=None, **kwargs)[source]¶

Bootstrapping Error Accumulation Reduction algorithm.

BEAR is a SAC-based data-driven deep reinforcement learning algorithm.

BEAR constrains the support of the policy function within data distribution by minimizing Maximum Mean Discreptancy (MMD) between the policy function and the approximated beahvior policy function \(\pi_\beta(a|s)\) which is optimized through L2 loss.

\[L(\beta) = \mathbb{E}_{s_t, a_t \sim D, a \sim \pi_\beta(\cdot|s_t)} [(a - a_t)^2]\]

The policy objective is a combination of SAC’s objective and MMD penalty.

\[J(\phi) = J_{SAC}(\phi) - \mathbb{E}_{s_t \sim D} \alpha ( \text{MMD}(\pi_\beta(\cdot|s_t), \pi_\phi(\cdot|s_t)) - \epsilon)\]

where MMD is computed as follows.

\[\text{MMD}(x, y) = \frac{1}{N^2} \sum_{i, i'} k(x_i, x_{i'}) - \frac{2}{NM} \sum_{i, j} k(x_i, y_j) + \frac{1}{M^2} \sum_{j, j'} k(y_j, y_{j'})\]

where \(k(x, y)\) is a gaussian kernel \(k(x, y) = \exp{((x - y)^2 / (2 \sigma^2))}\).

\(\alpha\) is also adjustable through dual gradient decsent where \(\alpha\) becomes smaller if MMD is smaller than the threshold \(\epsilon\).

References

Kumar et al., Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction.

Parameters:

actor_learning_rate (float) – learning rate for policy function.
critic_learning_rate (float) – learning rate for Q functions.
imitator_learning_rate (float) – learning rate for behavior policy function.
temp_learning_rate (float) – learning rate for temperature parameter.
alpha_learning_rate (float) – learning rate for \(\alpha\).
batch_size (int) – mini-batch size.
gamma (float) – discount factor.
tau (float) – target network synchronization coefficiency.
n_critics (int) – the number of Q functions for ensemble.
bootstrap (bool) – flag to bootstrap Q functions.
share_encoder (bool) – flag to share encoder network.
update_actor_interval (int) – interval to update policy function.
initial_temperature (float) – initial temperature value.
initial_alpha (float) – initial \(\alpha\) value.
alpha_threshold (float) – threshold value described as \(\epsilon\).
lam (float) – weight for critic ensemble.
n_action_samples (int) – the number of action samples to estimate action-values.
mmd_sigma (float) – \(\sigma\) for gaussian kernel in MMD calculation.
rl_start_epoch (int) – epoch to start to update policy function and Q functions. If this is large, RL training would be more stabilized.
eps (float) – \(\epsilon\) for Adam optimizer.
use_batch_norm (bool) – flag to insert batch normalization layers.
q_func_type (str) – type of Q function. Avaiable options are [‘mean’, ‘qr’, ‘iqn’, ‘fqf’].
n_epochs (int) – the number of epochs to train.
use_gpu (bool, int or d3rlpy.gpu.Device) – flag to use GPU, device iD or device.
scaler (d3rlpy.preprocessing.Scaler or str) – preprocessor. The avaiable options are [‘pixel’, ‘min_max’, ‘standard’].
augmentation (d3rlpy.augmentation.AugmentationPipeline or list(str)) – augmentation pipeline.
n_augmentations (int) – the number of data augmentations to update.
encoder_params (dict) – optional arguments for encoder setup. If the observation is pixel, you can pass filters with list of tuples consisting with (filter_size, kernel_size, stride) and feature_size with an integer scaler for the last linear layer size. If the observation is vector, you can pass hidden_units with list of hidden unit sizes.
dynamics (d3rlpy.dynamics.base.DynamicsBase) – dynamics model for data augmentation.
impl (d3rlpy.algos.torch.bear_impl.BEARImpl) – algorithm implementation.

actor_learning_rate¶

learning rate for policy function.

Type:	float

critic_learning_rate¶

learning rate for Q functions.

Type:	float

imitator_learning_rate¶

learning rate for behavior policy function.

Type:	float

temp_learning_rate¶

learning rate for temperature parameter.

Type:	float

alpha_learning_rate¶

learning rate for \(\alpha\).

Type:	float

batch_size¶

mini-batch size.

Type:	int

gamma¶

discount factor.

Type:	float

tau¶

target network synchronization coefficiency.

Type:	float

n_critics¶

the number of Q functions for ensemble.

Type:	int

bootstrap¶

flag to bootstrap Q functions.

Type:	bool

share_encoder¶

flag to share encoder network.

Type:	bool

update_actor_interval¶

interval to update policy function.

Type:	int

initial_temperature¶

initial temperature value.

Type:	float

initial_alpha¶

initial \(\alpha\) value.

Type:	float

alpha_threshold¶

threshold value described as \(\epsilon\).

Type:	float

lam¶

weight for critic ensemble.

Type:	float

n_action_samples¶

the number of action samples to estimate action-values.

Type:	int

mmd_sigma¶

\(\sigma\) for gaussian kernel in MMD calculation.

Type:	float

rl_start_epoch¶

epoch to start to update policy function and Q functions. If this is large, RL training would be more stabilized.

Type:	int

eps¶

\(\epsilon\) for Adam optimizer.

Type:	float

use_batch_norm¶

flag to insert batch normalization layers.

Type:	bool

q_func_type¶

type of Q function..

Type:	str

n_epochs¶

the number of epochs to train.

Type:	int

use_gpu¶

GPU device.

Type:	d3rlpy.gpu.Device

scaler¶

preprocessor.

Type:	d3rlpy.preprocessing.Scaler

augmentation¶

augmentation pipeline.

Type:	d3rlpy.augmentation.AugmentationPipeline

n_augmentations¶

the number of data augmentations to update.

Type:	int

encoder_params¶

optional arguments for encoder setup.

Type:	dict

dynamics¶

dynamics model.

Type:	d3rlpy.dynamics.base.DynamicsBase

impl¶

algorithm implementation.

Type:	d3rlpy.algos.torch.bear_impl.BEARImpl

Methods

create_impl(observation_shape, action_size)[source]¶

Instantiate implementation objects with the dataset shapes.

This method will be used internally when fit method is called.

Parameters:	observation_shape (tuple) – observation shape. action_size (int) – dimension of action-space.

fit(episodes, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard=True, eval_episodes=None, save_interval=1, scorers=None)¶

Trains with the given dataset.

algo.fit(episodes)

Parameters:

episodes (list(d3rlpy.dataset.Episode)) – list of episodes to train.
experiment_name (str) – experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.
with_timestamp (bool) – flag to add timestamp string to the last of directory name.
logdir (str) – root directory name to save logs.
verbose (bool) – flag to show logged information on stdout.
show_progress (bool) – flag to show progress bar for iterations.
tensorboard (bool) – flag to save logged information in tensorboard (additional to the csv data)
eval_episodes (list(d3rlpy.dataset.Episode)) – list of episodes to test.
save_interval (int) – interval to save parameters.
scorers (list(callable)) – list of scorer functions used with eval_episodes.

classmethod from_json(fname, use_gpu=False)¶

Returns algorithm configured with json file.

The Json file should be the one saved during fitting.

from d3rlpy.algos import Algo

# create algorithm with saved configuration
algo = Algo.from_json('d3rlpy_logs/<path-to-json>/params.json')

# ready to load
algo.load_model('d3rlpy_logs/<path-to-model>/model_100.pt')

# ready to predict
algo.predict(...)

Parameters:	fname (str) – file path to params.json. use_gpu (bool, int or d3rlpy.gpu.Device) – flag to use GPU, device ID or device.
Returns:	algorithm.
Return type:	d3rlpy.base.LearnableBase

get_params(deep=True)¶

Returns the all attributes.

This method returns the all attributes including ones in subclasses. Some of scikit-learn utilities will use this method.

params = algo.get_params(deep=True)

# the returned values can be used to instantiate the new object.
algo2 = AlgoBase(**params)

Parameters:	deep (bool) – flag to deeply copy objects such as impl.
Returns:	attribute values in dictionary.
Return type:	dict

load_model(fname)¶

Load neural network parameters.

algo.load_model('model.pt')

Parameters:	fname (str) – source file path.

predict(x)¶

Returns greedy actions.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

actions = algo.predict(x)
# actions.shape == (100, action size) for continuous control
# actions.shape == (100,) for discrete control

Parameters:	x (numpy.ndarray) – observations
Returns:	greedy actions
Return type:	numpy.ndarray

predict_value(x, action, with_std=False)¶

Returns predicted action-values.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

# for continuous control
# 100 actions with shape of (2,)
actions = np.random.random((100, 2))

# for discrete control
# 100 actions in integer values
actions = np.random.randint(2, size=100)

values = algo.predict_value(x, actions)
# values.shape == (100,)

values, stds = algo.predict_value(x, actions, with_std=True)
# stds.shape  == (100,)

Parameters:	x (numpy.ndarray) – observations action (numpy.ndarray) – actions with_std (bool) – flag to return standard deviation of ensemble estimation. This deviation reflects uncertainty for the given observations. This uncertainty will be more accurate if you enable bootstrap flag and increase n_critics value.
Returns:	predicted action-values
Return type:	numpy.ndarray

sample_action(x)¶

Returns sampled actions.

The sampled actions are identical to the output of predict method if the policy is deterministic.

Parameters:	x (numpy.ndarray) – observations.
Returns:	sampled actions.
Return type:	numpy.ndarray

save_model(fname)¶

Saves neural network parameters.

algo.save_model('model.pt')

Parameters:	fname (str) – destination file path.

save_policy(fname, as_onnx=False)¶

Save the greedy-policy computational graph as TorchScript or ONNX.

# save as TorchScript
algo.save_policy('policy.pt')

# save as ONNX
algo.save_policy('policy.onnx', as_onnx=True)

The artifacts saved with this method will work without d3rlpy. This method is especially useful to deploy the learned policy to production environments or embedding systems.