d3rlpy.metrics.SoftOPCEvaluator¶

class d3rlpy.metrics.SoftOPCEvaluator(return_threshold, episodes=None)[source]¶

Returns Soft Off-Policy Classification metrics.

The metric of the scorer funciton is evaluating gaps of action-value estimation between the success episodes and the all episodes. If the learned Q-function is optimal, action-values in success episodes are expected to be higher than the others. The success episode is defined as an episode with a return above the given threshold.

\[\mathbb{E}_{s, a \sim D_{success}} [Q(s, a)] - \mathbb{E}_{s, a \sim D} [Q(s, a)]\]

References

Irpan et al., Off-Policy Evaluation via Off-Policy Classification.

Parameters:

return_threshold – Return threshold of success episodes.
episodes – Optional evaluation episodes. If it’s not given, dataset used in training will be used.

Methods

__call__(algo, dataset)[source]¶

Computes metrics.

Parameters:

algo (QLearningAlgoProtocol) – Q-learning algorithm.
dataset (ReplayBufferBase) – ReplayBuffer.

Returns:

Computed metrics.

Return type:

float