d3rlpy.ope.FQE¶
- class d3rlpy.ope.FQE(algo, config, device=False, impl=None)[source]¶
Fitted Q Evaluation.
FQE is an off-policy evaluation method that approximates a Q function \(Q_\theta (s, a)\) with the trained policy \(\pi_\phi(s)\).
\[L(\theta) = \mathbb{E}_{s_t, a_t, r_{t+1} s_{t+1} \sim D} [(Q_\theta(s_t, a_t) - r_{t+1} - \gamma Q_{\theta'}(s_{t+1}, \pi_\phi(s_{t+1})))^2]\]The trained Q function in FQE will estimate evaluation metrics more accurately than learned Q function during training.
References
- Parameters
Methods
- build_with_dataset(dataset)¶
Instantiate implementation object with ReplayBuffer object.
- Parameters
dataset (d3rlpy.dataset.replay_buffer.ReplayBuffer) – dataset.
- Return type
- build_with_env(env)¶
Instantiate implementation object with OpenAI Gym object.
- Parameters
env (Union[gym.core.Env[Any, Any], gymnasium.core.Env[Any, Any]]) – gym-like environment.
- Return type
- collect(env, buffer=None, explorer=None, deterministic=False, n_steps=1000000, show_progress=True)¶
Collects data via interaction with environment.
If
buffer
is not given,ReplayBuffer
will be internally created.- Parameters
env (Union[gym.core.Env[Any, Any], gymnasium.core.Env[Any, Any]]) – Fym-like environment.
buffer (Optional[d3rlpy.dataset.replay_buffer.ReplayBuffer]) – Replay buffer.
explorer (Optional[d3rlpy.algos.qlearning.explorers.Explorer]) – Action explorer.
deterministic (bool) – Flag to collect data with the greedy policy.
n_steps (int) – Number of total steps to train.
show_progress (bool) – Flag to show progress bar for iterations.
- Returns
Replay buffer with the collected data.
- Return type
- copy_policy_from(algo)¶
Copies policy parameters from the given algorithm.
# pretrain with static dataset cql = d3rlpy.algos.CQL() cql.fit(dataset, n_steps=100000) # transfer to online algorithm sac = d3rlpy.algos.SAC() sac.create_impl(cql.observation_shape, cql.action_size) sac.copy_policy_from(cql)
- Parameters
algo (d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.base.QLearningAlgoImplBase, d3rlpy.base.LearnableConfig]) – Algorithm object.
- Return type
- copy_policy_optim_from(algo)¶
Copies policy optimizer states from the given algorithm.
# pretrain with static dataset cql = d3rlpy.algos.CQL() cql.fit(dataset, n_steps=100000) # transfer to online algorithm sac = d3rlpy.algos.SAC() sac.create_impl(cql.observation_shape, cql.action_size) sac.copy_policy_optim_from(cql)
- Parameters
algo (d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.base.QLearningAlgoImplBase, d3rlpy.base.LearnableConfig]) – Algorithm object.
- Return type
- copy_q_function_from(algo)¶
Copies Q-function parameters from the given algorithm.
# pretrain with static dataset cql = d3rlpy.algos.CQL() cql.fit(dataset, n_steps=100000) # transfer to online algorithmn sac = d3rlpy.algos.SAC() sac.create_impl(cql.observation_shape, cql.action_size) sac.copy_q_function_from(cql)
- Parameters
algo (d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.base.QLearningAlgoImplBase, d3rlpy.base.LearnableConfig]) – Algorithm object.
- Return type
- copy_q_function_optim_from(algo)¶
Copies Q-function optimizer states from the given algorithm.
# pretrain with static dataset cql = d3rlpy.algos.CQL() cql.fit(dataset, n_steps=100000) # transfer to online algorithm sac = d3rlpy.algos.SAC() sac.create_impl(cql.observation_shape, cql.action_size) sac.copy_policy_optim_from(cql)
- Parameters
algo (d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.base.QLearningAlgoImplBase, d3rlpy.base.LearnableConfig]) – Algorithm object.
- Return type
- create_impl(observation_shape, action_size)¶
Instantiate implementation objects with the dataset shapes.
This method will be used internally when fit method is called.
- fit(dataset, n_steps, n_steps_per_epoch=10000, experiment_name=None, with_timestamp=True, logger_adapter=<d3rlpy.logging.file_adapter.FileAdapterFactory object>, show_progress=True, save_interval=1, evaluators=None, callback=None, epoch_callback=None)¶
Trains with given dataset.
algo.fit(episodes, n_steps=1000000)
- Parameters
dataset (d3rlpy.dataset.replay_buffer.ReplayBuffer) – ReplayBuffer object.
n_steps (int) – Number of steps to train.
n_steps_per_epoch (int) – Number of steps per epoch. This value will be ignored when
n_steps
isNone
.experiment_name (Optional[str]) – Experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.
with_timestamp (bool) – Flag to add timestamp string to the last of directory name.
logger_adapter (d3rlpy.logging.logger.LoggerAdapterFactory) – LoggerAdapterFactory object.
show_progress (bool) – Flag to show progress bar for iterations.
save_interval (int) – Interval to save parameters.
evaluators (Optional[Dict[str, d3rlpy.metrics.evaluators.EvaluatorProtocol]]) – List of evaluators.
callback (Optional[Callable[[typing_extensions.Self, int, int], None]]) – Callable function that takes
(algo, epoch, total_step)
, which is called every step.epoch_callback (Optional[Callable[[typing_extensions.Self, int, int], None]]) – Callable function that takes
(algo, epoch, total_step)
, which is called at the end of every epoch.
- Returns
List of result tuples (epoch, metrics) per epoch.
- Return type
- fit_online(env, buffer=None, explorer=None, n_steps=1000000, n_steps_per_epoch=10000, update_interval=1, update_start_step=0, random_steps=0, eval_env=None, eval_epsilon=0.0, save_interval=1, experiment_name=None, with_timestamp=True, logger_adapter=<d3rlpy.logging.file_adapter.FileAdapterFactory object>, show_progress=True, callback=None)¶
Start training loop of online deep reinforcement learning.
- Parameters
env (Union[gym.core.Env[Any, Any], gymnasium.core.Env[Any, Any]]) – Gym-like environment.
buffer (Optional[d3rlpy.dataset.replay_buffer.ReplayBuffer]) – Replay buffer.
explorer (Optional[d3rlpy.algos.qlearning.explorers.Explorer]) – Action explorer.
n_steps (int) – Number of total steps to train.
n_steps_per_epoch (int) – Number of steps per epoch.
update_interval (int) – Number of steps per update.
update_start_step (int) – Steps before starting updates.
random_steps (int) – Steps for the initial random explortion.
eval_env (Optional[Union[gym.core.Env[Any, Any], gymnasium.core.Env[Any, Any]]]) – Gym-like environment. If None, evaluation is skipped.
eval_epsilon (float) – \(\epsilon\)-greedy factor during evaluation.
save_interval (int) – Number of epochs before saving models.
experiment_name (Optional[str]) – Experiment name for logging. If not passed, the directory name will be
{class name}_online_{timestamp}
.with_timestamp (bool) – Flag to add timestamp string to the last of directory name.
logger_adapter (d3rlpy.logging.logger.LoggerAdapterFactory) – LoggerAdapterFactory object.
show_progress (bool) – Flag to show progress bar for iterations.
callback (Optional[Callable[[typing_extensions.Self, int, int], None]]) – Callable function that takes
(algo, epoch, total_step)
, which is called at the end of epochs.
- Return type
- fitter(dataset, n_steps, n_steps_per_epoch=10000, experiment_name=None, with_timestamp=True, logger_adapter=<d3rlpy.logging.file_adapter.FileAdapterFactory object>, show_progress=True, save_interval=1, evaluators=None, callback=None, epoch_callback=None)¶
Iterate over epochs steps to train with the given dataset. At each iteration algo methods and properties can be changed or queried.
for epoch, metrics in algo.fitter(episodes): my_plot(metrics) algo.save_model(my_path)
- Parameters
dataset (d3rlpy.dataset.replay_buffer.ReplayBuffer) – Offline dataset to train.
n_steps (int) – Number of steps to train.
n_steps_per_epoch (int) – Number of steps per epoch. This value will be ignored when
n_steps
isNone
.experiment_name (Optional[str]) – Experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.
with_timestamp (bool) – Flag to add timestamp string to the last of directory name.
logger_adapter (d3rlpy.logging.logger.LoggerAdapterFactory) – LoggerAdapterFactory object.
show_progress (bool) – Flag to show progress bar for iterations.
save_interval (int) – Interval to save parameters.
evaluators (Optional[Dict[str, d3rlpy.metrics.evaluators.EvaluatorProtocol]]) – List of evaluators.
callback (Optional[Callable[[typing_extensions.Self, int, int], None]]) – Callable function that takes
(algo, epoch, total_step)
, which is called every step.epoch_callback (Optional[Callable[[typing_extensions.Self, int, int], None]]) – Callable function that takes
(algo, epoch, total_step)
, which is called at the end of every epoch.
- Returns
Iterator yielding current epoch and metrics dict.
- Return type
- classmethod from_json(fname, device=False)¶
Construct algorithm from params.json file.
from d3rlpy.algos import CQL cql = CQL.from_json("<path-to-json>", device='cuda:0')
- Parameters
- Returns
algorithm object.
- Return type
typing_extensions.Self
- get_action_type()[source]¶
Returns action type (continuous or discrete).
- Returns
action type.
- Return type
d3rlpy.constants.ActionSpace
- load_model(fname)¶
Load neural network parameters.
algo.load_model('model.pt')
- predict(x)¶
Returns greedy actions.
# 100 observations with shape of (10,) x = np.random.random((100, 10)) actions = algo.predict(x) # actions.shape == (100, action size) for continuous control # actions.shape == (100,) for discrete control
- Parameters
x (Union[numpy.ndarray[Any, numpy.dtype[Any]], Sequence[numpy.ndarray[Any, numpy.dtype[Any]]]]) – Observations
- Returns
Greedy actions
- Return type
numpy.ndarray[Any, numpy.dtype[Any]]
- predict_value(x, action)¶
Returns predicted action-values.
# 100 observations with shape of (10,) x = np.random.random((100, 10)) # for continuous control # 100 actions with shape of (2,) actions = np.random.random((100, 2)) # for discrete control # 100 actions in integer values actions = np.random.randint(2, size=100) values = algo.predict_value(x, actions) # values.shape == (100,)
- Parameters
x (Union[numpy.ndarray[Any, numpy.dtype[Any]], Sequence[numpy.ndarray[Any, numpy.dtype[Any]]]]) – Observations
action (numpy.ndarray[Any, numpy.dtype[Any]]) – Actions
- Returns
Predicted action-values
- Return type
numpy.ndarray[Any, numpy.dtype[Any]]
- reset_optimizer_states()¶
Resets optimizer states.
This is especially useful when fine-tuning policies with setting inital optimizer states.
- Return type
- sample_action(x)¶
Returns sampled actions.
The sampled actions are identical to the output of predict method if the policy is deterministic.
- Parameters
x (Union[numpy.ndarray[Any, numpy.dtype[Any]], Sequence[numpy.ndarray[Any, numpy.dtype[Any]]]]) – Observations.
- Returns
Sampled actions.
- Return type
numpy.ndarray[Any, numpy.dtype[Any]]
- save(fname)¶
Saves paired data of neural network parameters and serialized config.
algo.save('model.d3') # reconstruct everything algo2 = d3rlpy.load_learnable("model.d3", device="cuda:0")
- save_model(fname)¶
Saves neural network parameters.
algo.save_model('model.pt')
- save_policy(fname)¶
Save the greedy-policy computational graph as TorchScript or ONNX.
The format will be automatically detected by the file name.
# save as TorchScript algo.save_policy('policy.pt') # save as ONNX algo.save_policy('policy.onnx')
The artifacts saved with this method will work without d3rlpy. This method is especially useful to deploy the learned policy to production environments or embedding systems.
See also
https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html (for Python).
https://pytorch.org/tutorials/advanced/cpp_export.html (for C++).
https://onnx.ai (for ONNX)
- set_grad_step(grad_step)¶
Set total gradient step counter.
This method can be used to restart the middle of training with an arbitrary gradient step counter, which has effects on periodic functions such as the target update.
- update(batch)¶
Update parameters with mini-batch of data.
Attributes
- action_scaler¶
Preprocessing action scaler.
- Returns
preprocessing action scaler.
- Return type
Optional[ActionScaler]
- algo¶
- config¶
Config.
- Returns
config.
- Return type
LearnableConfig
- grad_step¶
Total gradient step counter.
This value will keep counting after
fit
andfit_online
methods finish.- Returns
total gradient step counter.
- impl¶
Implementation object.
- Returns
implementation object.
- Return type
Optional[ImplBase]
- observation_scaler¶
Preprocessing observation scaler.
- Returns
preprocessing observation scaler.
- Return type
Optional[ObservationScaler]
- observation_shape¶
Observation shape.
- Returns
observation shape.
- Return type
Optional[Sequence[int]]
- reward_scaler¶
Preprocessing reward scaler.
- Returns
preprocessing reward scaler.
- Return type
Optional[RewardScaler]