Online Training¶

Standard Training¶

d3rlpy provides not only offline training, but also online training utilities. Despite being designed for offline training algorithms, d3rlpy is flexible enough to be trained in an online manner with a few more utilities.

import gym

from d3rlpy.algos import DQN
from d3rlpy.online.buffers import ReplayBuffer
from d3rlpy.online.explorers import LinearDecayEpsilonGreedy

# setup environment
env = gym.make('CartPole-v0')
eval_env = gym.make('CartPole-v0')

# setup algorithm
dqn = DQN(batch_size=32,
          learning_rate=2.5e-4,
          target_update_interval=100,
          use_gpu=True)

# setup replay buffer
buffer = ReplayBuffer(maxlen=1000000, env=env)

# setup explorers
explorer = LinearDecayEpsilonGreedy(start_epsilon=1.0,
                                    end_epsilon=0.1,
                                    duration=10000)

# start training
dqn.fit_online(env,
               buffer,
               explorer=explorer, # you don't need this with probablistic policy algorithms
               eval_env=eval_env,
               n_epochs=30,
               n_steps_per_epoch=1000,
               n_updates_per_epoch=100)

Replay Buffer¶

d3rlpy.online.buffers.ReplayBuffer

Standard Replay Buffer.

Explorers¶

`d3rlpy.online.explorers.ConstantEpsilonGreedy`	\(\epsilon\)-greedy explorer with constant \(\epsilon\).
`d3rlpy.online.explorers.LinearDecayEpsilonGreedy`	\(\epsilon\)-greedy explorer with linear decay schedule.
`d3rlpy.online.explorers.NormalNoise`	Normal noise explorer.

Batch Concurrent Training¶

d3rlpy supports computationally efficient batch concurrent training.

import gym

from d3rlpy.algos import DQN
from d3rlpy.envs import AsyncBatchEnv
from d3rlpy.online.buffers import BatchReplayBuffer
from d3rlpy.online.explorers import LinearDecayEpsilonGreedy

# this condition is necessary due to spawning processes
if __name__ == '__main__':
    env = AsyncBatchEnv([lambda: gym.make('CartPole-v0') for _ in range(10)])

    eval_env = gym.make('CartPole-v0')

    # setup algorithm
    dqn = DQN(batch_size=32,
              learning_rate=2.5e-4,
              target_update_interval=100,
              use_gpu=True)

    # setup replay buffer
    buffer = BatchReplayBuffer(maxlen=1000000, env=env)

    # setup explorers
    explorer = LinearDecayEpsilonGreedy(start_epsilon=1.0,
                                        end_epsilon=0.1,
                                        duration=10000)

    # start training
    dqn.fit_batch_online(env,
                         buffer,
                         explorer=explorer, # you don't need this with probablistic policy algorithms
                         eval_env=eval_env,
                         n_epochs=30,
                         n_steps_per_epoch=1000,
                         n_updates_per_epoch=100)

For the environment wrapper, please see d3rlpy.envs.AsyncBatchEnv and d3rlpy.envs.SyncBatchEnv.

Replay Buffer¶

d3rlpy.online.buffers.BatchReplayBuffer

Standard Replay Buffer for batch training.