d3rlpy.ope.FQE

class d3rlpy.ope.FQE(algo, config, device=False, impl=None)[source]

Fitted Q Evaluation.

FQE is an off-policy evaluation method that approximates a Q function \(Q_\theta (s, a)\) with the trained policy \(\pi_\phi(s)\).

\[L(\theta) = \mathbb{E}_{s_t, a_t, r_{t+1} s_{t+1} \sim D} [(Q_\theta(s_t, a_t) - r_{t+1} - \gamma Q_{\theta'}(s_{t+1}, \pi_\phi(s_{t+1})))^2]\]

The trained Q function in FQE will estimate evaluation metrics more accurately than learned Q function during training.

References

Parameters
  • algo (d3rlpy.algos.base.AlgoBase) – Algorithm to evaluate.

  • config (d3rlpy.ope.FQEConfig) – FQE config.

  • device (bool, int or str) – Flag to use GPU, device ID or PyTorch device identifier.

  • impl (d3rlpy.metrics.ope.torch.FQEImpl) – Algorithm implementation.

Methods

build_with_dataset(dataset)

Instantiate implementation object with ReplayBuffer object.

Parameters

dataset (d3rlpy.dataset.replay_buffer.ReplayBuffer) – dataset.

Return type

None

build_with_env(env)

Instantiate implementation object with OpenAI Gym object.

Parameters

env (gym.core.Env[Any, Any]) – gym-like environment.

Return type

None

collect(env, buffer=None, explorer=None, deterministic=False, n_steps=1000000, show_progress=True)

Collects data via interaction with environment.

If buffer is not given, ReplayBuffer will be internally created.

Parameters
  • env (gym.core.Env[Any, Any]) – Fym-like environment.

  • buffer (Optional[d3rlpy.dataset.replay_buffer.ReplayBuffer]) – Replay buffer.

  • explorer (Optional[d3rlpy.algos.qlearning.explorers.Explorer]) – Action explorer.

  • deterministic (bool) – Flag to collect data with the greedy policy.

  • n_steps (int) – Number of total steps to train.

  • show_progress (bool) – Flag to show progress bar for iterations.

Returns

Replay buffer with the collected data.

Return type

d3rlpy.dataset.replay_buffer.ReplayBuffer

copy_policy_from(algo)

Copies policy parameters from the given algorithm.

# pretrain with static dataset
cql = d3rlpy.algos.CQL()
cql.fit(dataset, n_steps=100000)

# transfer to online algorithm
sac = d3rlpy.algos.SAC()
sac.create_impl(cql.observation_shape, cql.action_size)
sac.copy_policy_from(cql)
Parameters

algo (d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.base.QLearningAlgoImplBase, d3rlpy.base.LearnableConfig]) – Algorithm object.

Return type

None

copy_policy_optim_from(algo)

Copies policy optimizer states from the given algorithm.

# pretrain with static dataset
cql = d3rlpy.algos.CQL()
cql.fit(dataset, n_steps=100000)

# transfer to online algorithm
sac = d3rlpy.algos.SAC()
sac.create_impl(cql.observation_shape, cql.action_size)
sac.copy_policy_optim_from(cql)
Parameters

algo (d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.base.QLearningAlgoImplBase, d3rlpy.base.LearnableConfig]) – Algorithm object.

Return type

None

copy_q_function_from(algo)

Copies Q-function parameters from the given algorithm.

# pretrain with static dataset
cql = d3rlpy.algos.CQL()
cql.fit(dataset, n_steps=100000)

# transfer to online algorithmn
sac = d3rlpy.algos.SAC()
sac.create_impl(cql.observation_shape, cql.action_size)
sac.copy_q_function_from(cql)
Parameters

algo (d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.base.QLearningAlgoImplBase, d3rlpy.base.LearnableConfig]) – Algorithm object.

Return type

None

copy_q_function_optim_from(algo)

Copies Q-function optimizer states from the given algorithm.

# pretrain with static dataset
cql = d3rlpy.algos.CQL()
cql.fit(dataset, n_steps=100000)

# transfer to online algorithm
sac = d3rlpy.algos.SAC()
sac.create_impl(cql.observation_shape, cql.action_size)
sac.copy_policy_optim_from(cql)
Parameters

algo (d3rlpy.algos.qlearning.base.QLearningAlgoBase[d3rlpy.algos.qlearning.base.QLearningAlgoImplBase, d3rlpy.base.LearnableConfig]) – Algorithm object.

Return type

None

create_impl(observation_shape, action_size)

Instantiate implementation objects with the dataset shapes.

This method will be used internally when fit method is called.

Parameters
  • observation_shape (Union[Sequence[int], Sequence[Sequence[int]]]) – observation shape.

  • action_size (int) – dimension of action-space.

Return type

None

fit(dataset, n_steps, n_steps_per_epoch=10000, experiment_name=None, with_timestamp=True, logger_adapter=<d3rlpy.logging.file_adapter.FileAdapterFactory object>, show_progress=True, save_interval=1, evaluators=None, callback=None, epoch_callback=None)

Trains with given dataset.

algo.fit(episodes, n_steps=1000000)
Parameters
  • dataset (d3rlpy.dataset.replay_buffer.ReplayBuffer) – ReplayBuffer object.

  • n_steps (int) – Number of steps to train.

  • n_steps_per_epoch (int) – Number of steps per epoch. This value will be ignored when n_steps is None.

  • experiment_name (Optional[str]) – Experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.

  • with_timestamp (bool) – Flag to add timestamp string to the last of directory name.

  • logger_adapter (d3rlpy.logging.logger.LoggerAdapterFactory) – LoggerAdapterFactory object.

  • show_progress (bool) – Flag to show progress bar for iterations.

  • save_interval (int) – Interval to save parameters.

  • evaluators (Optional[Dict[str, d3rlpy.metrics.evaluators.EvaluatorProtocol]]) – List of evaluators.

  • callback (Optional[Callable[[typing_extensions.Self, int, int], None]]) – Callable function that takes (algo, epoch, total_step) , which is called every step.

  • epoch_callback (Optional[Callable[[typing_extensions.Self, int, int], None]]) – Callable function that takes (algo, epoch, total_step), which is called at the end of every epoch.

Returns

List of result tuples (epoch, metrics) per epoch.

Return type

List[Tuple[int, Dict[str, float]]]

fit_online(env, buffer=None, explorer=None, n_steps=1000000, n_steps_per_epoch=10000, update_interval=1, update_start_step=0, random_steps=0, eval_env=None, eval_epsilon=0.0, save_interval=1, experiment_name=None, with_timestamp=True, logger_adapter=<d3rlpy.logging.file_adapter.FileAdapterFactory object>, show_progress=True, callback=None)

Start training loop of online deep reinforcement learning.

Parameters
  • env (gym.core.Env[Any, Any]) – Gym-like environment.

  • buffer (Optional[d3rlpy.dataset.replay_buffer.ReplayBuffer]) – Replay buffer.

  • explorer (Optional[d3rlpy.algos.qlearning.explorers.Explorer]) – Action explorer.

  • n_steps (int) – Number of total steps to train.

  • n_steps_per_epoch (int) – Number of steps per epoch.

  • update_interval (int) – Number of steps per update.

  • update_start_step (int) – Steps before starting updates.

  • random_steps (int) – Steps for the initial random explortion.

  • eval_env (Optional[gym.core.Env[Any, Any]]) – Gym-like environment. If None, evaluation is skipped.

  • eval_epsilon (float) – \(\epsilon\)-greedy factor during evaluation.

  • save_interval (int) – Number of epochs before saving models.

  • experiment_name (Optional[str]) – Experiment name for logging. If not passed, the directory name will be {class name}_online_{timestamp}.

  • with_timestamp (bool) – Flag to add timestamp string to the last of directory name.

  • logger_adapter (d3rlpy.logging.logger.LoggerAdapterFactory) – LoggerAdapterFactory object.

  • show_progress (bool) – Flag to show progress bar for iterations.

  • callback (Optional[Callable[[typing_extensions.Self, int, int], None]]) – Callable function that takes (algo, epoch, total_step) , which is called at the end of epochs.

Return type

None

fitter(dataset, n_steps, n_steps_per_epoch=10000, experiment_name=None, with_timestamp=True, logger_adapter=<d3rlpy.logging.file_adapter.FileAdapterFactory object>, show_progress=True, save_interval=1, evaluators=None, callback=None, epoch_callback=None)

Iterate over epochs steps to train with the given dataset. At each iteration algo methods and properties can be changed or queried.

for epoch, metrics in algo.fitter(episodes):
    my_plot(metrics)
    algo.save_model(my_path)
Parameters
  • dataset (d3rlpy.dataset.replay_buffer.ReplayBuffer) – Offline dataset to train.

  • n_steps (int) – Number of steps to train.

  • n_steps_per_epoch (int) – Number of steps per epoch. This value will be ignored when n_steps is None.

  • experiment_name (Optional[str]) – Experiment name for logging. If not passed, the directory name will be {class name}_{timestamp}.

  • with_timestamp (bool) – Flag to add timestamp string to the last of directory name.

  • logger_adapter (d3rlpy.logging.logger.LoggerAdapterFactory) – LoggerAdapterFactory object.

  • show_progress (bool) – Flag to show progress bar for iterations.

  • save_interval (int) – Interval to save parameters.

  • evaluators (Optional[Dict[str, d3rlpy.metrics.evaluators.EvaluatorProtocol]]) – List of evaluators.

  • callback (Optional[Callable[[typing_extensions.Self, int, int], None]]) – Callable function that takes (algo, epoch, total_step) , which is called every step.

  • epoch_callback (Optional[Callable[[typing_extensions.Self, int, int], None]]) – Callable function that takes (algo, epoch, total_step), which is called at the end of every epoch.

Returns

Iterator yielding current epoch and metrics dict.

Return type

Generator[Tuple[int, Dict[str, float]], None, None]

classmethod from_json(fname, device=False)

Construct algorithm from params.json file.

from d3rlpy.algos import CQL

cql = CQL.from_json("<path-to-json>", device='cuda:0')
Parameters
  • fname (str) – path to params.json

  • device (Union[int, str, bool]) – device option. If the value is boolean and True, cuda:0 will be used. If the value is integer, cuda:<device> will be used. If the value is string in torch device style, the specified device will be used.

Returns

algorithm object.

Return type

typing_extensions.Self

get_action_type()[source]

Returns action type (continuous or discrete).

Returns

action type.

Return type

d3rlpy.constants.ActionSpace

inner_create_impl(observation_shape, action_size)[source]
Parameters
  • observation_shape (Union[Sequence[int], Sequence[Sequence[int]]]) –

  • action_size (int) –

Return type

None

inner_update(batch)

Update parameters with PyTorch mini-batch.

Parameters

batch (d3rlpy.torch_utility.TorchMiniBatch) – PyTorch mini-batch data.

Returns

Dictionary of metrics.

Return type

Dict[str, float]

load_model(fname)

Load neural network parameters.

algo.load_model('model.pt')
Parameters

fname (str) – source file path.

Return type

None

predict(x)

Returns greedy actions.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

actions = algo.predict(x)
# actions.shape == (100, action size) for continuous control
# actions.shape == (100,) for discrete control
Parameters

x (Union[numpy.ndarray, Sequence[numpy.ndarray]]) – Observations

Returns

Greedy actions

Return type

numpy.ndarray

predict_value(x, action)

Returns predicted action-values.

# 100 observations with shape of (10,)
x = np.random.random((100, 10))

# for continuous control
# 100 actions with shape of (2,)
actions = np.random.random((100, 2))

# for discrete control
# 100 actions in integer values
actions = np.random.randint(2, size=100)

values = algo.predict_value(x, actions)
# values.shape == (100,)
Parameters
Returns

Predicted action-values

Return type

numpy.ndarray

reset_optimizer_states()

Resets optimizer states.

This is especially useful when fine-tuning policies with setting inital optimizer states.

Return type

None

sample_action(x)

Returns sampled actions.

The sampled actions are identical to the output of predict method if the policy is deterministic.

Parameters

x (Union[numpy.ndarray, Sequence[numpy.ndarray]]) – Observations.

Returns

Sampled actions.

Return type

numpy.ndarray

save(fname)

Saves paired data of neural network parameters and serialized config.

algo.save('model.d3')

# reconstruct everything
algo2 = d3rlpy.load_learnable("model.d3", device="cuda:0")
Parameters

fname (str) – destination file path.

Return type

None

save_model(fname)

Saves neural network parameters.

algo.save_model('model.pt')
Parameters

fname (str) – destination file path.

Return type

None

save_policy(fname)

Save the greedy-policy computational graph as TorchScript or ONNX.

The format will be automatically detected by the file name.

# save as TorchScript
algo.save_policy('policy.pt')

# save as ONNX
algo.save_policy('policy.onnx')

The artifacts saved with this method will work without d3rlpy. This method is especially useful to deploy the learned policy to production environments or embedding systems.

See also

Parameters

fname (str) – Destination file path.

Return type

None

set_grad_step(grad_step)

Set total gradient step counter.

This method can be used to restart the middle of training with an arbitrary gradient step counter, which has effects on periodic functions such as the target update.

Parameters

grad_step (int) – total gradient step counter.

Return type

None

update(batch)

Update parameters with mini-batch of data.

Parameters

batch (d3rlpy.dataset.mini_batch.TransitionMiniBatch) – Mini-batch data.

Returns

Dictionary of metrics.

Return type

Dict[str, float]

Attributes

action_scaler

Preprocessing action scaler.

Returns

preprocessing action scaler.

Return type

Optional[ActionScaler]

action_size

Action size.

Returns

action size.

Return type

Optional[int]

algo
batch_size

Batch size to train.

Returns

batch size.

Return type

int

config

Config.

Returns

config.

Return type

LearnableConfig

gamma

Discount factor.

Returns

discount factor.

Return type

float

grad_step

Total gradient step counter.

This value will keep counting after fit and fit_online methods finish.

Returns

total gradient step counter.

impl

Implementation object.

Returns

implementation object.

Return type

Optional[ImplBase]

observation_scaler

Preprocessing observation scaler.

Returns

preprocessing observation scaler.

Return type

Optional[ObservationScaler]

observation_shape

Observation shape.

Returns

observation shape.

Return type

Optional[Sequence[int]]

reward_scaler

Preprocessing reward scaler.

Returns

preprocessing reward scaler.

Return type

Optional[RewardScaler]