d3rlpy.ope.FQE

class d3rlpy.ope.FQE(algo, config, device=False, impl=None)[source]

Fitted Q Evaluation.

FQE is an off-policy evaluation method that approximates a Q function \(Q_\theta (s, a)\) with the trained policy \(\pi_\phi(s)\).

\[L(\theta) = \mathbb{E}_{s_t, a_t, r_{t+1} s_{t+1} \sim D} [(Q_\theta(s_t, a_t) - r_{t+1} - \gamma Q_{\theta'}(s_{t+1}, \pi_\phi(s_{t+1})))^2]\]

The trained Q function in FQE will estimate evaluation metrics more accurately than learned Q function during training.

References

Parameters:
  • algo (d3rlpy.algos.base.AlgoBase) – Algorithm to evaluate.

  • config (d3rlpy.ope.FQEConfig) – FQE config.

  • device (bool, int or str) – Flag to use GPU, device ID or PyTorch device identifier.

  • impl (d3rlpy.metrics.ope.torch.FQEImpl) – Algorithm implementation.

Methods

build_with_dataset(dataset)

Instantiate implementation object with ReplayBuffer object.

Parameters:

dataset (ReplayBuffer) – dataset.

Return type:

None

build_with_env(env)

Instantiate implementation object with OpenAI Gym object.

Parameters:

env (Union[Env[Any, Any], Env[Any, Any]]) – gym-like environment.

Return type:

None

collect(env, buffer=None, explorer=None, deterministic=False, n_steps=1000000, show_progress=True)

Collects data via interaction with environment.

If buffer is not given, ReplayBuffer will be internally created.

Parameters:
  • env (Union[Env[Any, Any], Env[Any, Any]]) – Fym-like environment.

  • buffer (Optional[ReplayBufferBase]) – Replay buffer.

  • explorer (Optional[Explorer]) – Action explorer.

  • deterministic (bool) – Flag to collect data with the greedy policy.

  • n_steps (int) – Number of total steps to train.

  • show_progress (bool) – Flag to show progress bar for iterations.

Returns:

Replay buffer with the collected data.

Return type:

ReplayBufferBase

copy_policy_from(algo)

Copies policy parameters from the given algorithm.

# pretrain with static dataset
cql = d3rlpy.algos.CQL()
cql.fit(dataset, n_steps=100000)

# transfer to online algorithm
sac = d3rlpy.algos.SAC()
sac.create_impl(cql.observation_shape, cql.action_size)
sac.copy_policy_from(cql)
Parameters:

algo (QLearningAlgoBase[QLearningAlgoImplBase, LearnableConfig]) – Algorithm object.

Return type:

None

copy_policy_optim_from(algo)

Copies policy optimizer states from the given algorithm.

# pretrain with static dataset
cql = d3rlpy.algos.CQL()
cql.fit(dataset, n_steps=100000)

# transfer to online algorithm
sac = d3rlpy.algos.SAC()
sac.create_impl(cql.observation_shape, cql.action_size)
sac.copy_policy_optim_from(cql)
Parameters:

algo (QLearningAlgoBase[QLearningAlgoImplBase, LearnableConfig]) – Algorithm object.

Return type:

None

copy_q_function_from(algo)

Copies Q-function parameters from the given algorithm.

# pretrain with static dataset
cql = d3rlpy.algos.CQL()
cql.fit(dataset, n_steps=100000)

# transfer to online algorithmn
sac = d3rlpy.algos.SAC()
sac.create_impl(cql.observation_shape, cql.action_size)
sac.copy_q_function_from(cql)
Parameters:

algo (QLearningAlgoBase[QLearningAlgoImplBase, LearnableConfig]) – Algorithm object.

Return type:

None

copy_q_function_optim_from(algo)

Copies Q-function optimizer states from the given algorithm.

# pretrain with static dataset
cql = d3rlpy.algos.CQL()
cql.fit(dataset, n_steps=100000)

# transfer to online algorithm
sac = d3rlpy.algos.SAC()
sac.create_impl(cql.observation_shape, cql.action_size)
sac.copy_policy_optim_from(cql)
Parameters:

algo (QLearningAlgoBase[QLearningAlgoImplBase, LearnableConfig]) – Algorithm object.

Return type:

None

create_impl(observation_shape, action_size)

Instantiate implementation objects with the dataset shapes.

This method will be used internally when fit method is called.

Parameters:
Return type:

None

fit(dataset, n_steps, n_steps_per_epoch=10000, experiment_name=None, with_timestamp=True, logging_steps=500, logging_strategy=LoggingStrategy.EPOCH, logger_adapter=<d3rlpy.logging.file_adapter.FileAdapterFactory object>, show_progress=True, save_interval=1, evaluators=None, callback=None, epoch_callback=None, enable_ddp=False)

Trains with given dataset.

algo.fit(episodes, n_steps=1000000)
Parameters:
  • dataset (ReplayBufferBase) – ReplayBuffer object.

  • n_steps (int) – Number of steps to train.

  • n_steps_per_epoch (int) – Number of steps per epoch. This value will be ignored when n_steps is None.

  • experiment_name (Optional[str]) – Experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.

  • with_timestamp (bool) – Flag to add timestamp string to the last of directory name.

  • logging_steps (int) – Number of steps to log metrics. This will be ignored if logging_strategy is EPOCH.

  • logging_strategy (LoggingStrategy) – Logging strategy to use.

  • logger_adapter (LoggerAdapterFactory) – LoggerAdapterFactory object.

  • show_progress (bool) – Flag to show progress bar for iterations.

  • save_interval (int) – Interval to save parameters.

  • evaluators (Optional[Dict[str, EvaluatorProtocol]]) – List of evaluators.

  • callback (Optional[Callable[[Self, int, int], None]]) – Callable function that takes (algo, epoch, total_step) , which is called every step.

  • epoch_callback (Optional[Callable[[Self, int, int], None]]) – Callable function that takes (algo, epoch, total_step), which is called at the end of every epoch.

  • enable_ddp (bool) – Flag to wrap models with DataDistributedParallel.

Returns:

List of result tuples (epoch, metrics) per epoch.

Return type:

List[Tuple[int, Dict[str, float]]]

fit_online(env, buffer=None, explorer=None, n_steps=1000000, n_steps_per_epoch=10000, update_interval=1, n_updates=1, update_start_step=0, random_steps=0, eval_env=None, eval_epsilon=0.0, save_interval=1, experiment_name=None, with_timestamp=True, logging_steps=500, logging_strategy=LoggingStrategy.EPOCH, logger_adapter=<d3rlpy.logging.file_adapter.FileAdapterFactory object>, show_progress=True, callback=None)

Start training loop of online deep reinforcement learning.

Parameters:
  • env (Union[Env[Any, Any], Env[Any, Any]]) – Gym-like environment.

  • buffer (Optional[ReplayBufferBase]) – Replay buffer.

  • explorer (Optional[Explorer]) – Action explorer.

  • n_steps (int) – Number of total steps to train.

  • n_steps_per_epoch (int) – Number of steps per epoch.

  • update_interval (int) – Number of steps per update.

  • n_updates (int) – Number of gradient steps at a time. The combination of update_interval and n_updates controls Update-To-Data (UTD) ratio.

  • update_start_step (int) – Steps before starting updates.

  • random_steps (int) – Steps for the initial random explortion.

  • eval_env (Optional[Union[Env[Any, Any], Env[Any, Any]]]) – Gym-like environment. If None, evaluation is skipped.

  • eval_epsilon (float) – \(\epsilon\)-greedy factor during evaluation.

  • save_interval (int) – Number of epochs before saving models.

  • experiment_name (Optional[str]) – Experiment name for logging. If not passed, the directory name will be {class name}_online_{timestamp}.

  • with_timestamp (bool) – Flag to add timestamp string to the last of directory name.

  • logging_steps (int) – Number of steps to log metrics. This will be ignored if logging_strategy is EPOCH.

  • logging_strategy (LoggingStrategy) – Logging strategy to use.

  • logger_adapter (LoggerAdapterFactory) – LoggerAdapterFactory object.

  • show_progress (bool) – Flag to show progress bar for iterations.

  • callback (Optional[Callable[[Self, int, int], None]]) – Callable function that takes (algo, epoch, total_step) , which is called at the end of epochs.

Return type:

None

fitter(dataset, n_steps, n_steps_per_epoch=10000, logging_steps=500, logging_strategy=LoggingStrategy.EPOCH, experiment_name=None, with_timestamp=True, logger_adapter=<d3rlpy.logging.file_adapter.FileAdapterFactory object>, show_progress=True, save_interval=1, evaluators=None, callback=None, epoch_callback=None, enable_ddp=False)

Iterate over epochs steps to train with the given dataset. At each iteration algo methods and properties can be changed or queried.

for epoch, metrics in algo.fitter(episodes):
    my_plot(metrics)
    algo.save_model(my_path)
Parameters:
  • dataset (ReplayBufferBase) – Offline dataset to train.

  • n_steps (int) – Number of steps to train.

  • n_steps_per_epoch (int) – Number of steps per epoch. This value will be ignored when n_steps is None.

  • experiment_name (Optional[str]) – Experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.

  • with_timestamp (bool) – Flag to add timestamp string to the last of directory name.

  • logging_steps (int) – Number of steps to log metrics. This will be ignored if loggig_strategy is EPOCH.

  • logging_strategy (LoggingStrategy) – Logging strategy to use.

  • logger_adapter (LoggerAdapterFactory) – LoggerAdapterFactory object.

  • show_progress (bool) – Flag to show progress bar for iterations.

  • save_interval (int) – Interval to save parameters.

  • evaluators (Optional[Dict[str, EvaluatorProtocol]]) – List of evaluators.

  • callback (Optional[Callable[[Self, int, int], None]]) – Callable function that takes (algo, epoch, total_step) , which is called every step.

  • epoch_callback (Optional[Callable[[Self, int, int], None]]) – Callable function that takes (algo, epoch, total_step), which is called at the end of every epoch.

  • enable_ddp (bool) – Flag to wrap models with DataDistributedParallel.

Returns:

Iterator yielding current epoch and metrics dict.

Return type:

Generator[Tuple[int, Dict[str, float]], None, None]

classmethod from_json(fname, device=False)

Construct algorithm from params.json file.

from d3rlpy.algos import CQL

cql = CQL.from_json("<path-to-json>", device='cuda:0')
Parameters:
  • fname (str) – path to params.json

  • device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns:

algorithm object.

Return type:

Self

get_action_type()[source]

Returns action type (continuous or discrete).

Returns:

action type.

Return type:

ActionSpace

inner_create_impl(observation_shape, action_size)[source]
Parameters:
Return type:

None

load_model(fname)

Load neural network parameters.

algo.load_model('model.pt')
Parameters:

fname (str) – source file path.

Return type:

None

predict(x)

Returns greedy actions.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

actions = algo.predict(x)
# actions.shape == (100, action size) for continuous control
# actions.shape == (100,) for discrete control
Parameters:

x (Union[ndarray[Any, dtype[Any]], Sequence[ndarray[Any, dtype[Any]]]]) – Observations

Returns:

Greedy actions

Return type:

ndarray[Any, dtype[Any]]

predict_value(x, action)

Returns predicted action-values.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

# for continuous control
# 100 actions with shape of (2,)
actions = np.random.random((100, 2))

# for discrete control
# 100 actions in integer values
actions = np.random.randint(2, size=100)

values = algo.predict_value(x, actions)
# values.shape == (100,)
Parameters:
Returns:

Predicted action-values

Return type:

ndarray[Any, dtype[Any]]

reset_optimizer_states()

Resets optimizer states.

This is especially useful when fine-tuning policies with setting inital optimizer states.

Return type:

None

sample_action(x)

Returns sampled actions.

The sampled actions are identical to the output of predict method if the policy is deterministic.

Parameters:

x (Union[ndarray[Any, dtype[Any]], Sequence[ndarray[Any, dtype[Any]]]]) – Observations.

Returns:

Sampled actions.

Return type:

ndarray[Any, dtype[Any]]

save(fname)

Saves paired data of neural network parameters and serialized config.

algo.save('model.d3')

# reconstruct everything
algo2 = d3rlpy.load_learnable("model.d3", device="cuda:0")
Parameters:

fname (str) – destination file path.

Return type:

None

save_model(fname)

Saves neural network parameters.

algo.save_model('model.pt')
Parameters:

fname (str) – destination file path.

Return type:

None

save_policy(fname)

Save the greedy-policy computational graph as TorchScript or ONNX.

The format will be automatically detected by the file name.

# save as TorchScript
algo.save_policy('policy.pt')

# save as ONNX
algo.save_policy('policy.onnx')

The artifacts saved with this method will work without d3rlpy. This method is especially useful to deploy the learned policy to production environments or embedding systems.

See also

Parameters:

fname (str) – Destination file path.

Return type:

None

set_grad_step(grad_step)

Set total gradient step counter.

This method can be used to restart the middle of training with an arbitrary gradient step counter, which has effects on periodic functions such as the target update.

Parameters:

grad_step (int) – total gradient step counter.

Return type:

None

update(batch)

Update parameters with mini-batch of data.

Parameters:

batch (TransitionMiniBatch) – Mini-batch data.

Returns:

Dictionary of metrics.

Return type:

Dict[str, float]

Attributes

action_scaler

Preprocessing action scaler.

Returns:

preprocessing action scaler.

Return type:

Optional[ActionScaler]

action_size

Action size.

Returns:

action size.

Return type:

Optional[int]

algo
batch_size

Batch size to train.

Returns:

batch size.

Return type:

int

config

Config.

Returns:

config.

Return type:

LearnableConfig

gamma

Discount factor.

Returns:

discount factor.

Return type:

float

grad_step

Total gradient step counter.

This value will keep counting after fit and fit_online methods finish.

Returns:

total gradient step counter.

impl

Implementation object.

Returns:

implementation object.

Return type:

Optional[ImplBase]

need_returns_to_go
observation_scaler

Preprocessing observation scaler.

Returns:

preprocessing observation scaler.

Return type:

Optional[ObservationScaler]

observation_shape

Observation shape.

Returns:

observation shape.

Return type:

Optional[Sequence[int]]

reward_scaler

Preprocessing reward scaler.

Returns:

preprocessing reward scaler.

Return type:

Optional[RewardScaler]