我一直致力于使用强化学习算法解决 Gym Taxi-v3 问题。最初,我应用了表格 Q 学习,经过 10,000 次训练迭代后,该算法获得了 8.x 的平均奖励,成功率为 100%,这是令人满意的。
但是,当我尝试使用 DQN(深度 Q 学习网络)解决问题时,结果并不那么好。经过大约 100 次训练迭代后,评估 Episode_reward_mean 似乎收敛于 -210 左右,episode_len_mean 收敛于 200 左右。
根据我从 ChatGPT 学到的知识,DQN 应该适用于 Taxi-v3 问题。我不确定为什么我的模型表现不佳。
如果有人能够深入了解可能出现的问题以及如何使用 DQN 有效解决 Taxi-v3 问题,我将不胜感激。我对 DQN 特别感兴趣,因为我相信它比表格 Q 学习更适合复杂的实际问题。
我的 DQN 培训和评估代码
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.algorithms.dqn.dqn import DQN, DQNConfig
from ray.rllib.algorithms.a2c import A2CConfig
import ray
import csv
import datetime
import os
ray.init(local_mode=True)
# ray.init(address='auto') # connect to Ray cluster
# config = DQNConfig()
num_rollout_workers = 62
max_train_iter_times = 20000
config = DQNConfig()
config = config.environment("Taxi-v3")
config = config.rollouts(num_rollout_workers=num_rollout_workers)
config = config.framework("torch")
# Update exploration_config
exploration_config={
"type": "EpsilonGreedy",
"initial_epsilon": 1.0,
"final_epsilon": 0.02,
"epsilon_timesteps": max_train_iter_times
}
config = config.exploration(exploration_config=exploration_config)
config.evaluation_config = {
"evaluation_interval": 10,
"evaluation_num_episodes": 10,
}
# Update replay_buffer_config
replay_buffer_config = {
"_enable_replay_buffer_api": True,
"type": "MultiAgentPrioritizedReplayBuffer",
"capacity": 1000,
"prioritized_replay_alpha": 0.5,
"prioritized_replay_beta": 0.5,
"prioritized_replay_eps": 3e-6,
}
config = config.training(
model={"fcnet_hiddens": [50, 50, 50]},
lr=0.001,
gamma=0.99,
replay_buffer_config=replay_buffer_config,
target_network_update_freq=500,
double_q=True,
dueling=True,
num_atoms=1,
noisy=False,
n_step=3,
)
algo = DQN(config=config)
# algo = config.build() # 2. build the algorithm,
no_improvement_counter = 0
prev_reward = None
# Get the current date
current_date = datetime.datetime.now().strftime('%Y%m%d')
# Open the csv file in write mode
with open(f'train_{current_date}.csv', 'w', newline='') as file:
writer = csv.writer(file)
# Write the header row
writer.writerow(["Iteration", "Reward_Mean", "Episode_Length_Mean"])
for i in range(max_train_iter_times):
print(f'#{i}: {algo.train()}\n') # 3. train it,
# Save the model every 5 iterations
if (i + 1) % 10 == 0:
checkpoint = algo.save()
print("Model checkpoint saved at", checkpoint)
eval_result = algo.evaluate()
print(f'to evaluate model: {eval_result}') # 4. and evaluate it.
cur_reward = eval_result['evaluation']['sampler_results']['episode_reward_mean']
cur_episode_len_mean = eval_result['evaluation']['sampler_results']['episode_len_mean']
# Write the iteration, reward and episode length to csv
writer.writerow([i + 1, cur_reward, cur_episode_len_mean])
# Force the file to be written to disk immediately
file.flush()
os.fsync(file.fileno())
if prev_reward is not None and cur_reward <= prev_reward:
no_improvement_counter += 1
else:
no_improvement_counter = 0
print(f'evaluated episode_reward_mean: {cur_reward}, no improvement counter: {no_improvement_counter}\n')
if no_improvement_counter >= 20:
print(f"Training stopped as the episode_reward_mean did not improve for 20 consecutive evaluations. totalIterNum: {i + 1}")
break
prev_reward = cur_reward
我尝试将 DQN replay_buffer_config 容量修改为 10000,n_step 修改为 20,但不起作用。结果是一样的。
您在训练过程中看到什么进步吗?
假设您这样做,我查看您的配置文件的第一直觉是您正在测试时应用探索。您必须确保配置中的
explore
标志设置为 False
,即 DQNConfig.evaluation(evaluation_config={"explore": False})