d3rlpy.ope.FQE¶

class d3rlpy.ope.FQE(algo, config, device=False, impl=None)[source]¶

Fitted Q Evaluation.

FQE is an off-policy evaluation method that approximates a Q function \(Q_\theta (s, a)\) with the trained policy \(\pi_\phi(s)\).

\[L(\theta) = \mathbb{E}_{s_t, a_t, r_{t+1} s_{t+1} \sim D} [(Q_\theta(s_t, a_t) - r_{t+1} - \gamma Q_{\theta'}(s_{t+1}, \pi_\phi(s_{t+1})))^2]\]

The trained Q function in FQE will estimate evaluation metrics more accurately than learned Q function during training.

References

Le et al., Batch Policy Learning under Constraints.

Parameters:

algo (d3rlpy.algos.base.AlgoBase) – Algorithm to evaluate.
config (d3rlpy.ope.FQEConfig) – FQE config.
device (bool, int or str) – Flag to use GPU, device ID or PyTorch device identifier.
impl (d3rlpy.metrics.ope.torch.FQEImpl) – Algorithm implementation.

Methods

build_with_dataset(dataset)¶

Instantiate implementation object with ReplayBuffer object.

Parameters:: dataset (ReplayBuffer) – dataset.
Return type:: None

build_with_env(env)¶

Instantiate implementation object with OpenAI Gym object.

Parameters:: env (Union[Env[Any, Any], Env[Any, Any]]) – gym-like environment.
Return type:: None

collect(env, buffer=None, explorer=None, deterministic=False, n_steps=1000000, show_progress=True)¶

Collects data via interaction with environment.

If buffer is not given, ReplayBuffer will be internally created.

Parameters:

env (Union[Env[Any, Any], Env[Any, Any]]) – Fym-like environment.
buffer (Optional[ReplayBufferBase]) – Replay buffer.
explorer (Optional[Explorer]) – Action explorer.
deterministic (bool) – Flag to collect data with the greedy policy.
n_steps (int) – Number of total steps to train.
show_progress (bool) – Flag to show progress bar for iterations.

Returns:

Replay buffer with the collected data.

Return type:

ReplayBufferBase

copy_policy_from(algo)¶

Copies policy parameters from the given algorithm.

# pretrain with static dataset
cql = d3rlpy.algos.CQL()
cql.fit(dataset, n_steps=100000)

# transfer to online algorithm
sac = d3rlpy.algos.SAC()
sac.create_impl(cql.observation_shape, cql.action_size)
sac.copy_policy_from(cql)

Parameters:: algo (QLearningAlgoBase[QLearningAlgoImplBase, LearnableConfig]) – Algorithm object.
Return type:: None

copy_policy_optim_from(algo)¶

Copies policy optimizer states from the given algorithm.

# pretrain with static dataset
cql = d3rlpy.algos.CQL()
cql.fit(dataset, n_steps=100000)

# transfer to online algorithm
sac = d3rlpy.algos.SAC()
sac.create_impl(cql.observation_shape, cql.action_size)
sac.copy_policy_optim_from(cql)

Parameters:: algo (QLearningAlgoBase[QLearningAlgoImplBase, LearnableConfig]) – Algorithm object.
Return type:: None

copy_q_function_from(algo)¶

Copies Q-function parameters from the given algorithm.

# pretrain with static dataset
cql = d3rlpy.algos.CQL()
cql.fit(dataset, n_steps=100000)

# transfer to online algorithmn
sac = d3rlpy.algos.SAC()
sac.create_impl(cql.observation_shape, cql.action_size)
sac.copy_q_function_from(cql)

Parameters:: algo (QLearningAlgoBase[QLearningAlgoImplBase, LearnableConfig]) – Algorithm object.
Return type:: None

copy_q_function_optim_from(algo)¶

Copies Q-function optimizer states from the given algorithm.

# pretrain with static dataset
cql = d3rlpy.algos.CQL()
cql.fit(dataset, n_steps=100000)

# transfer to online algorithm
sac = d3rlpy.algos.SAC()
sac.create_impl(cql.observation_shape, cql.action_size)
sac.copy_policy_optim_from(cql)

Parameters:: algo (QLearningAlgoBase[QLearningAlgoImplBase, LearnableConfig]) – Algorithm object.
Return type:: None

create_impl(observation_shape, action_size)¶

Instantiate implementation objects with the dataset shapes.

This method will be used internally when fit method is called.

Parameters:

observation_shape (Union[Sequence[int], Sequence[Sequence[int]]]) – observation shape.
action_size (int) – dimension of action-space.

Return type:

None

fit(dataset, n_steps, n_steps_per_epoch=10000, experiment_name=None, with_timestamp=True, logging_steps=500, logging_strategy=LoggingStrategy.EPOCH, logger_adapter=<d3rlpy.logging.file_adapter.FileAdapterFactory object>, show_progress=True, save_interval=1, evaluators=None, callback=None, epoch_callback=None, enable_ddp=False)¶

Trains with given dataset.

algo.fit(episodes, n_steps=1000000)

Parameters:

dataset (ReplayBufferBase) – ReplayBuffer object.
n_steps (int) – Number of steps to train.
n_steps_per_epoch (int) – Number of steps per epoch. This value will be ignored when n_steps is None.
experiment_name (Optional[str]) – Experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.
with_timestamp (bool) – Flag to add timestamp string to the last of directory name.
logging_steps (int) – Number of steps to log metrics. This will be ignored if logging_strategy is EPOCH.
logging_strategy (LoggingStrategy) – Logging strategy to use.
logger_adapter (LoggerAdapterFactory) – LoggerAdapterFactory object.
show_progress (bool) – Flag to show progress bar for iterations.
save_interval (int) – Interval to save parameters.
evaluators (Optional[Dict[str, EvaluatorProtocol]]) – List of evaluators.
callback (Optional[Callable[[Self, int, int], None]]) – Callable function that takes (algo, epoch, total_step) , which is called every step.
epoch_callback (Optional[Callable[[Self, int, int], None]]) – Callable function that takes (algo, epoch, total_step), which is called at the end of every epoch.
enable_ddp (bool) – Flag to wrap models with DataDistributedParallel.

Returns:

List of result tuples (epoch, metrics) per epoch.

Return type:

List[Tuple[int, Dict[str, float]]]

fit_online(env, buffer=None, explorer=None, n_steps=1000000, n_steps_per_epoch=10000, update_interval=1, n_updates=1, update_start_step=0, random_steps=0, eval_env=None, eval_epsilon=0.0, save_interval=1, experiment_name=None, with_timestamp=True, logging_steps=500, logging_strategy=LoggingStrategy.EPOCH, logger_adapter=<d3rlpy.logging.file_adapter.FileAdapterFactory object>, show_progress=True, callback=None)¶

Start training loop of online deep reinforcement learning.

Parameters:

env (Union[Env[Any, Any], Env[Any, Any]]) – Gym-like environment.
buffer (Optional[ReplayBufferBase]) – Replay buffer.
explorer (Optional[Explorer]) – Action explorer.
n_steps (int) – Number of total steps to train.
n_steps_per_epoch (int) – Number of steps per epoch.
update_interval (int) – Number of steps per update.
n_updates (int) – Number of gradient steps at a time. The combination of update_interval and n_updates controls Update-To-Data (UTD) ratio.
update_start_step (int) – Steps before starting updates.
random_steps (int) – Steps for the initial random explortion.
eval_env (Optional[Union[Env[Any, Any], Env[Any, Any]]]) – Gym-like environment. If None, evaluation is skipped.
eval_epsilon (float) – \(\epsilon\)-greedy factor during evaluation.
save_interval (int) – Number of epochs before saving models.
experiment_name (Optional[str]) – Experiment name for logging. If not passed, the directory name will be {class name}_online_{timestamp}.
with_timestamp (bool) – Flag to add timestamp string to the last of directory name.
logging_steps (int) – Number of steps to log metrics. This will be ignored if logging_strategy is EPOCH.
logging_strategy (LoggingStrategy) – Logging strategy to use.
logger_adapter (LoggerAdapterFactory) – LoggerAdapterFactory object.
show_progress (bool) – Flag to show progress bar for iterations.
callback (Optional[Callable[[Self, int, int], None]]) – Callable function that takes (algo, epoch, total_step) , which is called at the end of epochs.

Return type:

None

fitter(dataset, n_steps, n_steps_per_epoch=10000, logging_steps=500, logging_strategy=LoggingStrategy.EPOCH, experiment_name=None, with_timestamp=True, logger_adapter=<d3rlpy.logging.file_adapter.FileAdapterFactory object>, show_progress=True, save_interval=1, evaluators=None, callback=None, epoch_callback=None, enable_ddp=False)¶

Iterate over epochs steps to train with the given dataset. At each iteration algo methods and properties can be changed or queried.

for epoch, metrics in algo.fitter(episodes):
    my_plot(metrics)
    algo.save_model(my_path)

Parameters:

dataset (ReplayBufferBase) – Offline dataset to train.
n_steps (int) – Number of steps to train.
n_steps_per_epoch (int) – Number of steps per epoch. This value will be ignored when n_steps is None.
experiment_name (Optional[str]) – Experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.
with_timestamp (bool) – Flag to add timestamp string to the last of directory name.
logging_steps (int) – Number of steps to log metrics. This will be ignored if loggig_strategy is EPOCH.
logging_strategy (LoggingStrategy) – Logging strategy to use.
logger_adapter (LoggerAdapterFactory) – LoggerAdapterFactory object.
show_progress (bool) – Flag to show progress bar for iterations.
save_interval (int) – Interval to save parameters.
evaluators (Optional[Dict[str, EvaluatorProtocol]]) – List of evaluators.
callback (Optional[Callable[[Self, int, int], None]]) – Callable function that takes (algo, epoch, total_step) , which is called every step.
epoch_callback (Optional[Callable[[Self, int, int], None]]) – Callable function that takes (algo, epoch, total_step), which is called at the end of every epoch.
enable_ddp (bool) – Flag to wrap models with DataDistributedParallel.

Returns:

Iterator yielding current epoch and metrics dict.

Return type:

Generator[Tuple[int, Dict[str, float]], None, None]

classmethod from_json(fname, device=False)¶

Construct algorithm from params.json file.

from d3rlpy.algos import CQL

cql = CQL.from_json("<path-to-json>", device='cuda:0')

Parameters:

fname (str) – path to params.json
device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns:

algorithm object.

Return type:

Self

get_action_type()[source]¶

Returns action type (continuous or discrete).

Returns:: action type.
Return type:: ActionSpace

inner_create_impl(observation_shape, action_size)[source]¶

Parameters:

observation_shape (Union[Sequence[int], Sequence[Sequence[int]]]) –
action_size (int) –

Return type:

None

load_model(fname)¶

Load neural network parameters.

algo.load_model('model.pt')

Parameters:: fname (str) – source file path.
Return type:: None

predict(x)¶

Returns greedy actions.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

actions = algo.predict(x)
# actions.shape == (100, action size) for continuous control
# actions.shape == (100,) for discrete control

Parameters:: x (Union[ndarray[Any, dtype[Any]], Sequence[ndarray[Any, dtype[Any]]]]) – Observations
Returns:: Greedy actions
Return type:: ndarray[Any, dtype[Any]]

predict_value(x, action)¶

Returns predicted action-values.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

# for continuous control
# 100 actions with shape of (2,)
actions = np.random.random((100, 2))

# for discrete control
# 100 actions in integer values
actions = np.random.randint(2, size=100)

values = algo.predict_value(x, actions)
# values.shape == (100,)

Parameters:

x (Union[ndarray[Any, dtype[Any]], Sequence[ndarray[Any, dtype[Any]]]]) – Observations
action (ndarray[Any, dtype[Any]]) – Actions

Returns:

Predicted action-values

Return type:

ndarray[Any, dtype[Any]]

reset_optimizer_states()¶

Resets optimizer states.

This is especially useful when fine-tuning policies with setting inital optimizer states.

Return type:: None

sample_action(x)¶

Returns sampled actions.

The sampled actions are identical to the output of predict method if the policy is deterministic.

Parameters:: x (Union[ndarray[Any, dtype[Any]], Sequence[ndarray[Any, dtype[Any]]]]) – Observations.
Returns:: Sampled actions.
Return type:: ndarray[Any, dtype[Any]]

save(fname)¶

Saves paired data of neural network parameters and serialized config.

algo.save('model.d3')

# reconstruct everything
algo2 = d3rlpy.load_learnable("model.d3", device="cuda:0")

Parameters:: fname (str) – destination file path.
Return type:: None

save_model(fname)¶

Saves neural network parameters.

algo.save_model('model.pt')

Parameters:: fname (str) – destination file path.
Return type:: None

save_policy(fname)¶

Save the greedy-policy computational graph as TorchScript or ONNX.

The format will be automatically detected by the file name.

# save as TorchScript
algo.save_policy('policy.pt')

# save as ONNX
algo.save_policy('policy.onnx')

The artifacts saved with this method will work without d3rlpy. This method is especially useful to deploy the learned policy to production environments or embedding systems.