d3rlpy.metrics.scorer.discounted_sum_of_advantage_scorer¶

d3rlpy.metrics.scorer.discounted_sum_of_advantage_scorer(algo, episodes, window_size=1024)[source]¶

Returns average of discounted sum of advantage (in negative scale).

This metrics suggests how the greedy-policy selects different actions in action-value space. If the sum of advantage is small, the policy selects actions with larger estimated action-values.

\[\mathbb{E}_{s_t, a_t \sim D} [\sum_{t' = t} \gamma^{t' - t} A(s_{t'}, a_{t'})]\]

where \(A(s_t, a_t) = Q_\theta (s_t, a_t) - \max_a Q_\theta (s_t, a)\).

References

Murphy., A generalization error for Q-Learning.

Parameters:	algo (d3rlpy.algos.base.AlgoBase) – algorithm. episodes (list(d3rlpy.dataset.Episode)) – list of episodes. window_size (int) – mini-batch size to compute.
Returns:	negative average of discounted sum of advantage.
Return type:	float