d3rlpy.metrics.scorer.soft_opc_scorer¶

d3rlpy.metrics.scorer.soft_opc_scorer(return_threshold)[source]¶

Returns Soft Off-Policy Classification metrics.

This function returns scorer function, which is suitable to the standard scikit-learn scorer function style. The metrics of the scorer funciton is evaluating gaps of action-value estimation between the success episodes and the all episodes. If the learned Q-function is optimal, action-values in success episodes are expected to be higher than the others. The success episode is defined as an episode with a return above the given threshold.

\[\mathbb{E}_{s, a \sim D_{success}} [Q(s, a)] - \mathbb{E}_{s, a \sim D} [Q(s, a)]\]

from d3rlpy.datasets import get_cartpole
from d3rlpy.algos import DQN
from d3rlpy.metrics.scorer import soft_opc_scorer
from sklearn.model_selection import train_test_split

dataset, _ = get_cartpole()
train_episodes, test_episodes = train_test_split(dataset, test_size=0.2)

scorer = soft_opc_scorer(return_threshold=180)

dqn = DQN()
dqn.fit(train_episodes,
        eval_episodes=test_episodes,
        scorers={'soft_opc': scorer})

References

Irpan et al., Off-Policy Evaluation via Off-Policy Classification.

Parameters: return_threshold (float) – threshold of success episodes.
Returns: scorer function.
Return type: Callable[[d3rlpy.metrics.scorer.AlgoProtocol, List[d3rlpy.dataset.Episode]], float]