d3rlpy.algos.DQN

class d3rlpy.algos.DQN(*, learning_rate=6.25e-05, optim_factory=<d3rlpy.optimizers.AdamFactory object>, encoder_factory='default', q_func_factory='mean', batch_size=32, n_frames=1, n_steps=1, gamma=0.99, n_critics=1, bootstrap=False, share_encoder=False, target_update_interval=8000.0, use_gpu=False, scaler=None, augmentation=None, dynamics=None, impl=None, **kwargs)[source]

Deep Q-Network algorithm.

\[L(\theta) = \mathbb{E}_{s_t, a_t, r_{t+1}, s_{t+1} \sim D} [(r_{t+1} + \gamma \max_a Q_{\theta'}(s_{t+1}, a) - Q_\theta(s_t, a_t))^2]\]

where \(\theta'\) is the target network parameter. The target network parameter is synchronized every target_update_interval iterations.

References

Parameters:
  • learning_rate (float) – learning rate.
  • optim_factory (d3rlpy.optimizers.OptimizerFactory or str) – optimizer factory.
  • encoder_factory (d3rlpy.encoders.EncoderFactory or str) – encoder factory.
  • q_func_factory (d3rlpy.q_functions.QFunctionFactory or str) – Q function factory.
  • batch_size (int) – mini-batch size.
  • n_frames (int) – the number of frames to stack for image observation.
  • n_steps (int) – N-step TD calculation.
  • gamma (float) – discount factor.
  • n_critics (int) – the number of Q functions for ensemble.
  • bootstrap (bool) – flag to bootstrap Q functions.
  • share_encoder (bool) – flag to share encoder network.
  • target_update_interval (int) – interval to update the target network.
  • use_gpu (bool, int or d3rlpy.gpu.Device) – flag to use GPU, device ID or device.
  • scaler (d3rlpy.preprocessing.Scaler or str) – preprocessor. The available options are [‘pixel’, ‘min_max’, ‘standard’]
  • augmentation (d3rlpy.augmentation.AugmentationPipeline or list(str)) – augmentation pipeline.
  • dynamics (d3rlpy.dynamics.base.DynamicsBase) – dynamics model for data augmentation.
  • impl (d3rlpy.algos.torch.dqn_impl.DQNImpl) – algorithm implementation.
learning_rate

learning rate.

Type:float
optim_factory

optimizer factory.

Type:d3rlpy.optimizers.OptimizerFactory
encoder_factory

encoder factory.

Type:d3rlpy.encoders.EncoderFactory
q_func_factory

Q function factory.

Type:d3rlpy.q_functions.QFunctionFactory
batch_size

mini-batch size.

Type:int
n_frames

the number of frames to stack for image observation.

Type:int
n_steps

N-step TD calculation.

Type:int
gamma

discount factor.

Type:float
n_critics

the number of Q functions for ensemble.

Type:int
bootstrap

flag to bootstrap Q functions.

Type:bool
share_encoder

flag to share encoder network.

Type:bool
target_update_interval

interval to update the target network.

Type:int
use_gpu

GPU device.

Type:d3rlpy.gpu.Device
scaler

preprocessor.

Type:d3rlpy.preprocessing.Scaler
augmentation

augmentation pipeline.

Type:d3rlpy.augmentation.AugmentationPipeline
dynamics

dynamics model.

Type:d3rlpy.dynamics.base.DynamicsBase
impl

algorithm implementation.

Type:d3rlpy.algos.torch.dqn_impl.DQNImpl
eval_results_

evaluation results.

Type:dict

Methods

build_with_dataset(dataset)

Instantiate implementation object with MDPDataset object.

Parameters:dataset (d3rlpy.dataset.MDPDataset) – dataset.
build_with_env(env)

Instantiate implementation object with OpenAI Gym object.

Parameters:env (gym.Env) – gym-like environment.
create_impl(observation_shape, action_size)[source]

Instantiate implementation objects with the dataset shapes.

This method will be used internally when fit method is called.

Parameters:
  • observation_shape (tuple) – observation shape.
  • action_size (int) – dimension of action-space.
fit(episodes, n_epochs=1000, save_metrics=True, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard=True, eval_episodes=None, save_interval=1, scorers=None, shuffle=True)

Trains with the given dataset.

algo.fit(episodes)
Parameters:
  • episodes (list(d3rlpy.dataset.Episode)) – list of episodes to train.
  • n_epochs (int) – the number of epochs to train.
  • save_metrics (bool) – flag to record metrics in files. If False, the log directory is not created and the model parameters are not saved during training.
  • experiment_name (str) – experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.
  • with_timestamp (bool) – flag to add timestamp string to the last of directory name.
  • logdir (str) – root directory name to save logs.
  • verbose (bool) – flag to show logged information on stdout.
  • show_progress (bool) – flag to show progress bar for iterations.
  • tensorboard (bool) – flag to save logged information in tensorboard (additional to the csv data)
  • eval_episodes (list(d3rlpy.dataset.Episode)) – list of episodes to test.
  • save_interval (int) – interval to save parameters.
  • scorers (list(callable)) – list of scorer functions used with eval_episodes.
  • shuffle (bool) – flag to shuffle transitions on each epoch.
fit_online(env, buffer, explorer=None, n_steps=1000000, n_steps_per_epoch=10000, update_interval=1, update_start_step=0, eval_env=None, eval_epsilon=0.0, save_metrics=True, experiment_name=None, with_timestamp=True, logdir='d3rlpy_logs', verbose=True, show_progress=True, tensorboard=True)

Start training loop of online deep reinforcement learning.

This method is a convenient alias to d3rlpy.online.iterators.train.

Parameters:
  • env (gym.Env) – gym-like environment.
  • buffer (d3rlpy.online.buffers.Buffer) – replay buffer.
  • explorer (d3rlpy.online.explorers.Explorer) – action explorer.
  • n_steps (int) – the number of total steps to train.
  • n_steps_per_epoch (int) – the number of steps per epoch.
  • update_interval (int) – the number of steps per update.
  • update_start_step (int) – the steps before starting updates.
  • eval_env (gym.Env) – gym-like environment. If None, evaluation is skipped.
  • eval_epsilon (float) – \(\epsilon\)-greedy factor during evaluation.
  • save_metrics (bool) – flag to record metrics. If False, the log directory is not created and the model parameters are not saved.
  • experiment_name (str) – experiment name for logging. If not passed, the directory name will be {class name}_online_{timestamp}.
  • with_timestamp (bool) – flag to add timestamp string to the last of directory name.
  • logdir (str) – root directory name to save logs.
  • verbose (bool) – flag to show logged information on stdout.
  • show_progress (bool) – flag to show progress bar for iterations.
  • tensorboard (bool) – flag to save logged information in tensorboard (additional to the csv data)
classmethod from_json(fname, use_gpu=False)

Returns algorithm configured with json file.

The Json file should be the one saved during fitting.

from d3rlpy.algos import Algo

# create algorithm with saved configuration
algo = Algo.from_json('d3rlpy_logs/<path-to-json>/params.json')

# ready to load
algo.load_model('d3rlpy_logs/<path-to-model>/model_100.pt')

# ready to predict
algo.predict(...)
Parameters:
  • fname (str) – file path to params.json.
  • use_gpu (bool, int or d3rlpy.gpu.Device) – flag to use GPU, device ID or device.
Returns:

algorithm.

Return type:

d3rlpy.base.LearnableBase

get_params(deep=True)

Returns the all attributes.

This method returns the all attributes including ones in subclasses. Some of scikit-learn utilities will use this method.

params = algo.get_params(deep=True)

# the returned values can be used to instantiate the new object.
algo2 = AlgoBase(**params)
Parameters:deep (bool) – flag to deeply copy objects such as impl.
Returns:attribute values in dictionary.
Return type:dict
load_model(fname)

Load neural network parameters.

algo.load_model('model.pt')
Parameters:fname (str) – source file path.
predict(x)

Returns greedy actions.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

actions = algo.predict(x)
# actions.shape == (100, action size) for continuous control
# actions.shape == (100,) for discrete control
Parameters:x (numpy.ndarray) – observations
Returns:greedy actions
Return type:numpy.ndarray
predict_value(x, action, with_std=False)

Returns predicted action-values.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

# for continuous control
# 100 actions with shape of (2,)
actions = np.random.random((100, 2))

# for discrete control
# 100 actions in integer values
actions = np.random.randint(2, size=100)

values = algo.predict_value(x, actions)
# values.shape == (100,)

values, stds = algo.predict_value(x, actions, with_std=True)
# stds.shape  == (100,)
Parameters:
  • x (numpy.ndarray) – observations
  • action (numpy.ndarray) – actions
  • with_std (bool) – flag to return standard deviation of ensemble estimation. This deviation reflects uncertainty for the given observations. This uncertainty will be more accurate if you enable bootstrap flag and increase n_critics value.
Returns:

predicted action-values

Return type:

numpy.ndarray

sample_action(x)

Returns sampled actions.

The sampled actions are identical to the output of predict method if the policy is deterministic.

Parameters:x (numpy.ndarray) – observations.
Returns:sampled actions.
Return type:numpy.ndarray
save_model(fname)

Saves neural network parameters.

algo.save_model('model.pt')
Parameters:fname (str) – destination file path.
save_policy(fname, as_onnx=False)

Save the greedy-policy computational graph as TorchScript or ONNX.

# save as TorchScript
algo.save_policy('policy.pt')

# save as ONNX
algo.save_policy('policy.onnx', as_onnx=True)

The artifacts saved with this method will work without d3rlpy. This method is especially useful to deploy the learned policy to production environments or embedding systems.

See also

Parameters:
  • fname (str) – destination file path.
  • as_onnx (bool) – flag to save as ONNX format.
set_params(**params)

Sets the given arguments to the attributes if they exist.

This method sets the given values to the attributes including ones in subclasses. If the values that don’t exist as attributes are passed, they are ignored. Some of scikit-learn utilities will use this method.

algo.set_params(batch_size=100)
Parameters:**params – arbitrary inputs to set as attributes.
Returns:itself.
Return type:d3rlpy.algos.base.AlgoBase
update(epoch, total_step, batch)[source]

Update parameters with mini-batch of data.

Parameters:
Returns:

loss values.

Return type:

list