Online RL¶
Prepare Environment¶
d3rlpy supports environments with OpenAI Gym interface. In this tutorial, let’s use simple CartPole environment.
import gym
# for training
env = gym.make("CartPole-v0")
# for evaluation
eval_env = gym.make("CartPole-v0")
Setup Algorithm¶
Just like offline RL training, you can setup an algorithm object.
import d3rlpy
# if you don't use GPU, set use_gpu=False instead.
dqn = d3rlpy.algos.DQN(
batch_size=32,
learning_rate=2.5e-4,
target_update_interval=100,
use_gpu=True,
)
# initialize neural networks with the given environment object.
# this is not necessary when you directly call fit or fit_online method.
dqn.build_with_env(env)
Setup Online RL Utilities¶
Unlike offline RL training, you’ll need to setup an experience replay buffer and an exploration strategy.
# experience replay buffer
buffer = d3rlpy.online.buffers.ReplayBuffer(maxlen=100000, env=env)
# exploration strategy
# in this tutorial, epsilon-greedy policy with static epsilon=0.3
explorer = d3rlpy.online.explorers.ConstantEpsilonGreedy(0.3)
Start Training¶
Now, you have everything you need to start online RL training. Let’s put them together!
dqn.fit_online(
env,
buffer,
explorer,
n_steps=100000, # train for 100K steps
eval_env=eval_env,
n_steps_per_epoch=1000, # evaluation is performed every 1K steps
update_start_step=1000, # parameter update starts after 1K steps
)
Train with Stochastic Policy¶
If the algorithm uses a stochastic policy (e.g. SAC), you can train algorithms without setting an exploration strategy.
sac = d3rlpy.algos.DiscreteSAC()
sac.fit_online(
env,
buffer,
n_steps=100000,
eval_env=eval_env,
n_steps_per_epoch=1000,
update_start_step=1000,
)