d3rlpy.ope.FQE¶
- class d3rlpy.ope.FQE(algo, config, device=False, impl=None)[source]¶
Fitted Q Evaluation.
FQE is an off-policy evaluation method that approximates a Q function \(Q_\theta (s, a)\) with the trained policy \(\pi_\phi(s)\).
\[L(\theta) = \mathbb{E}_{s_t, a_t, r_{t+1} s_{t+1} \sim D} [(Q_\theta(s_t, a_t) - r_{t+1} - \gamma Q_{\theta'}(s_{t+1}, \pi_\phi(s_{t+1})))^2]\]The trained Q function in FQE will estimate evaluation metrics more accurately than learned Q function during training.
References
- Parameters:
Methods
- build_with_dataset(dataset)¶
Instantiate implementation object with ReplayBuffer object.
- Parameters:
dataset (ReplayBuffer) – dataset.
- Return type:
None
- build_with_env(env)¶
Instantiate implementation object with OpenAI Gym object.
- collect(env, buffer=None, explorer=None, deterministic=False, n_steps=1000000, show_progress=True)¶
Collects data via interaction with environment.
If
buffer
is not given,ReplayBuffer
will be internally created.- Parameters:
env (Union[Env[Any, Any], Env[Any, Any]]) – Fym-like environment.
buffer (Optional[ReplayBufferBase]) – Replay buffer.
explorer (Optional[Explorer]) – Action explorer.
deterministic (bool) – Flag to collect data with the greedy policy.
n_steps (int) – Number of total steps to train.
show_progress (bool) – Flag to show progress bar for iterations.
- Returns:
Replay buffer with the collected data.
- Return type:
- copy_policy_from(algo)¶
Copies policy parameters from the given algorithm.
# pretrain with static dataset cql = d3rlpy.algos.CQL() cql.fit(dataset, n_steps=100000) # transfer to online algorithm sac = d3rlpy.algos.SAC() sac.create_impl(cql.observation_shape, cql.action_size) sac.copy_policy_from(cql)
- Parameters:
algo (QLearningAlgoBase[QLearningAlgoImplBase, LearnableConfig]) – Algorithm object.
- Return type:
None
- copy_policy_optim_from(algo)¶
Copies policy optimizer states from the given algorithm.
# pretrain with static dataset cql = d3rlpy.algos.CQL() cql.fit(dataset, n_steps=100000) # transfer to online algorithm sac = d3rlpy.algos.SAC() sac.create_impl(cql.observation_shape, cql.action_size) sac.copy_policy_optim_from(cql)
- Parameters:
algo (QLearningAlgoBase[QLearningAlgoImplBase, LearnableConfig]) – Algorithm object.
- Return type:
None
- copy_q_function_from(algo)¶
Copies Q-function parameters from the given algorithm.
# pretrain with static dataset cql = d3rlpy.algos.CQL() cql.fit(dataset, n_steps=100000) # transfer to online algorithmn sac = d3rlpy.algos.SAC() sac.create_impl(cql.observation_shape, cql.action_size) sac.copy_q_function_from(cql)
- Parameters:
algo (QLearningAlgoBase[QLearningAlgoImplBase, LearnableConfig]) – Algorithm object.
- Return type:
None
- copy_q_function_optim_from(algo)¶
Copies Q-function optimizer states from the given algorithm.
# pretrain with static dataset cql = d3rlpy.algos.CQL() cql.fit(dataset, n_steps=100000) # transfer to online algorithm sac = d3rlpy.algos.SAC() sac.create_impl(cql.observation_shape, cql.action_size) sac.copy_policy_optim_from(cql)
- Parameters:
algo (QLearningAlgoBase[QLearningAlgoImplBase, LearnableConfig]) – Algorithm object.
- Return type:
None
- create_impl(observation_shape, action_size)¶
Instantiate implementation objects with the dataset shapes.
This method will be used internally when fit method is called.
- fit(dataset, n_steps, n_steps_per_epoch=10000, experiment_name=None, with_timestamp=True, logging_steps=500, logging_strategy=LoggingStrategy.EPOCH, logger_adapter=<d3rlpy.logging.file_adapter.FileAdapterFactory object>, show_progress=True, save_interval=1, evaluators=None, callback=None, epoch_callback=None, enable_ddp=False)¶
Trains with given dataset.
algo.fit(episodes, n_steps=1000000)
- Parameters:
dataset (ReplayBufferBase) – ReplayBuffer object.
n_steps (int) – Number of steps to train.
n_steps_per_epoch (int) – Number of steps per epoch. This value will be ignored when
n_steps
isNone
.experiment_name (Optional[str]) – Experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.
with_timestamp (bool) – Flag to add timestamp string to the last of directory name.
logging_steps (int) – Number of steps to log metrics. This will be ignored if logging_strategy is EPOCH.
logging_strategy (LoggingStrategy) – Logging strategy to use.
logger_adapter (LoggerAdapterFactory) – LoggerAdapterFactory object.
show_progress (bool) – Flag to show progress bar for iterations.
save_interval (int) – Interval to save parameters.
evaluators (Optional[Dict[str, EvaluatorProtocol]]) – List of evaluators.
callback (Optional[Callable[[Self, int, int], None]]) – Callable function that takes
(algo, epoch, total_step)
, which is called every step.epoch_callback (Optional[Callable[[Self, int, int], None]]) – Callable function that takes
(algo, epoch, total_step)
, which is called at the end of every epoch.enable_ddp (bool) – Flag to wrap models with DataDistributedParallel.
- Returns:
List of result tuples (epoch, metrics) per epoch.
- Return type:
- fit_online(env, buffer=None, explorer=None, n_steps=1000000, n_steps_per_epoch=10000, update_interval=1, n_updates=1, update_start_step=0, random_steps=0, eval_env=None, eval_epsilon=0.0, save_interval=1, experiment_name=None, with_timestamp=True, logging_steps=500, logging_strategy=LoggingStrategy.EPOCH, logger_adapter=<d3rlpy.logging.file_adapter.FileAdapterFactory object>, show_progress=True, callback=None)¶
Start training loop of online deep reinforcement learning.
- Parameters:
env (Union[Env[Any, Any], Env[Any, Any]]) – Gym-like environment.
buffer (Optional[ReplayBufferBase]) – Replay buffer.
explorer (Optional[Explorer]) – Action explorer.
n_steps (int) – Number of total steps to train.
n_steps_per_epoch (int) – Number of steps per epoch.
update_interval (int) – Number of steps per update.
n_updates (int) – Number of gradient steps at a time. The combination of
update_interval
andn_updates
controls Update-To-Data (UTD) ratio.update_start_step (int) – Steps before starting updates.
random_steps (int) – Steps for the initial random explortion.
eval_env (Optional[Union[Env[Any, Any], Env[Any, Any]]]) – Gym-like environment. If None, evaluation is skipped.
eval_epsilon (float) – \(\epsilon\)-greedy factor during evaluation.
save_interval (int) – Number of epochs before saving models.
experiment_name (Optional[str]) – Experiment name for logging. If not passed, the directory name will be
{class name}_online_{timestamp}
.with_timestamp (bool) – Flag to add timestamp string to the last of directory name.
logging_steps (int) – Number of steps to log metrics. This will be ignored if logging_strategy is EPOCH.
logging_strategy (LoggingStrategy) – Logging strategy to use.
logger_adapter (LoggerAdapterFactory) – LoggerAdapterFactory object.
show_progress (bool) – Flag to show progress bar for iterations.
callback (Optional[Callable[[Self, int, int], None]]) – Callable function that takes
(algo, epoch, total_step)
, which is called at the end of epochs.
- Return type:
None
- fitter(dataset, n_steps, n_steps_per_epoch=10000, logging_steps=500, logging_strategy=LoggingStrategy.EPOCH, experiment_name=None, with_timestamp=True, logger_adapter=<d3rlpy.logging.file_adapter.FileAdapterFactory object>, show_progress=True, save_interval=1, evaluators=None, callback=None, epoch_callback=None, enable_ddp=False)¶
Iterate over epochs steps to train with the given dataset. At each iteration algo methods and properties can be changed or queried.
for epoch, metrics in algo.fitter(episodes): my_plot(metrics) algo.save_model(my_path)
- Parameters:
dataset (ReplayBufferBase) – Offline dataset to train.
n_steps (int) – Number of steps to train.
n_steps_per_epoch (int) – Number of steps per epoch. This value will be ignored when
n_steps
isNone
.experiment_name (Optional[str]) – Experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.
with_timestamp (bool) – Flag to add timestamp string to the last of directory name.
logging_steps (int) – Number of steps to log metrics. This will be ignored if loggig_strategy is EPOCH.
logging_strategy (LoggingStrategy) – Logging strategy to use.
logger_adapter (LoggerAdapterFactory) – LoggerAdapterFactory object.
show_progress (bool) – Flag to show progress bar for iterations.
save_interval (int) – Interval to save parameters.
evaluators (Optional[Dict[str, EvaluatorProtocol]]) – List of evaluators.
callback (Optional[Callable[[Self, int, int], None]]) – Callable function that takes
(algo, epoch, total_step)
, which is called every step.epoch_callback (Optional[Callable[[Self, int, int], None]]) – Callable function that takes
(algo, epoch, total_step)
, which is called at the end of every epoch.enable_ddp (bool) – Flag to wrap models with DataDistributedParallel.
- Returns:
Iterator yielding current epoch and metrics dict.
- Return type:
- classmethod from_json(fname, device=False)¶
Construct algorithm from params.json file.
from d3rlpy.algos import CQL cql = CQL.from_json("<path-to-json>", device='cuda:0')
- Parameters:
- Returns:
algorithm object.
- Return type:
Self
- get_action_type()[source]¶
Returns action type (continuous or discrete).
- Returns:
action type.
- Return type:
ActionSpace
- load_model(fname)¶
Load neural network parameters.
algo.load_model('model.pt')
- Parameters:
fname (str) – source file path.
- Return type:
None
- predict(x)¶
Returns greedy actions.
# 100 observations with shape of (10,) x = np.random.random((100, 10)) actions = algo.predict(x) # actions.shape == (100, action size) for continuous control # actions.shape == (100,) for discrete control
- predict_value(x, action)¶
Returns predicted action-values.
# 100 observations with shape of (10,) x = np.random.random((100, 10)) # for continuous control # 100 actions with shape of (2,) actions = np.random.random((100, 2)) # for discrete control # 100 actions in integer values actions = np.random.randint(2, size=100) values = algo.predict_value(x, actions) # values.shape == (100,)
- reset_optimizer_states()¶
Resets optimizer states.
This is especially useful when fine-tuning policies with setting inital optimizer states.
- Return type:
None
- sample_action(x)¶
Returns sampled actions.
The sampled actions are identical to the output of predict method if the policy is deterministic.
- save(fname)¶
Saves paired data of neural network parameters and serialized config.
algo.save('model.d3') # reconstruct everything algo2 = d3rlpy.load_learnable("model.d3", device="cuda:0")
- Parameters:
fname (str) – destination file path.
- Return type:
None
- save_model(fname)¶
Saves neural network parameters.
algo.save_model('model.pt')
- Parameters:
fname (str) – destination file path.
- Return type:
None
- save_policy(fname)¶
Save the greedy-policy computational graph as TorchScript or ONNX.
The format will be automatically detected by the file name.
# save as TorchScript algo.save_policy('policy.pt') # save as ONNX algo.save_policy('policy.onnx')
The artifacts saved with this method will work without d3rlpy. This method is especially useful to deploy the learned policy to production environments or embedding systems.
See also
https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html (for Python).
https://pytorch.org/tutorials/advanced/cpp_export.html (for C++).
https://onnx.ai (for ONNX)
- Parameters:
fname (str) – Destination file path.
- Return type:
None
- set_grad_step(grad_step)¶
Set total gradient step counter.
This method can be used to restart the middle of training with an arbitrary gradient step counter, which has effects on periodic functions such as the target update.
- Parameters:
grad_step (int) – total gradient step counter.
- Return type:
None
- update(batch)¶
Update parameters with mini-batch of data.
Attributes
- action_scaler¶
Preprocessing action scaler.
- Returns:
preprocessing action scaler.
- Return type:
Optional[ActionScaler]
- algo¶
- config¶
Config.
- Returns:
config.
- Return type:
LearnableConfig
- grad_step¶
Total gradient step counter.
This value will keep counting after
fit
andfit_online
methods finish.- Returns:
total gradient step counter.
- impl¶
Implementation object.
- Returns:
implementation object.
- Return type:
Optional[ImplBase]
- need_returns_to_go¶
- observation_scaler¶
Preprocessing observation scaler.
- Returns:
preprocessing observation scaler.
- Return type:
Optional[ObservationScaler]
- observation_shape¶
Observation shape.
- Returns:
observation shape.
- Return type:
Optional[Sequence[int]]
- reward_scaler¶
Preprocessing reward scaler.
- Returns:
preprocessing reward scaler.
- Return type:
Optional[RewardScaler]