d3rlpy.metrics.DiscountedSumOfAdvantageEvaluator¶

class d3rlpy.metrics.DiscountedSumOfAdvantageEvaluator(episodes=None)[source]¶

Returns average of discounted sum of advantage.

This metric suggests how the greedy-policy selects different actions in action-value space. If the sum of advantage is small, the policy selects actions with larger estimated action-values.

\[\mathbb{E}_{s_t, a_t \sim D} [\sum_{t' = t} \gamma^{t' - t} A(s_{t'}, a_{t'})]\]

where \(A(s_t, a_t) = Q_\theta (s_t, a_t) - \mathbb{E}_{a \sim \pi} [Q_\theta (s_t, a)]\).

References

Murphy., A generalization error for Q-Learning.

Parameters:: episodes – Optional evaluation episodes. If it’s not given, dataset used in training will be used.

Methods

__call__(algo, dataset)[source]¶

Computes metrics.

Parameters:

algo (QLearningAlgoProtocol) – Q-learning algorithm.
dataset (ReplayBufferBase) – ReplayBuffer.

Returns:

Computed metrics.

Return type:

float