针对 Gym 的 Taxi-v3 问题实施 DQN 很困难

问题描述 投票:0回答:1

我一直致力于使用强化学习算法解决 Gym Taxi-v3 问题。最初,我应用了表格 Q 学习,经过 10,000 次训练迭代后,该算法获得了 8.x 的平均奖励,成功率为 100%,这是令人满意的。

但是,当我尝试使用 DQN(深度 Q 学习网络)解决问题时,结果并不那么好。经过大约 100 次训练迭代后,评估 Episode_reward_mean 似乎收敛于 -210 左右,episode_len_mean 收敛于 200 左右。

根据我从 ChatGPT 学到的知识,DQN 应该适用于 Taxi-v3 问题。我不确定为什么我的模型表现不佳。

如果有人能够深入了解可能出现的问题以及如何使用 DQN 有效解决 Taxi-v3 问题,我将不胜感激。我对 DQN 特别感兴趣,因为我相信它比表格 Q 学习更适合复杂的实际问题。

评估结果截图

我的 DQN 培训和评估代码

from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.algorithms.dqn.dqn import DQN, DQNConfig
from ray.rllib.algorithms.a2c import A2CConfig
import ray
import csv
import datetime
import os

ray.init(local_mode=True)
# ray.init(address='auto')  # connect to Ray cluster
# config = DQNConfig()

num_rollout_workers = 62
max_train_iter_times = 20000

config = DQNConfig()
config = config.environment("Taxi-v3")
config = config.rollouts(num_rollout_workers=num_rollout_workers)
config = config.framework("torch")

# Update exploration_config
exploration_config={
    "type": "EpsilonGreedy",
    "initial_epsilon": 1.0,
    "final_epsilon": 0.02,
    "epsilon_timesteps": max_train_iter_times
}
config = config.exploration(exploration_config=exploration_config)
config.evaluation_config = {
     "evaluation_interval": 10,
    "evaluation_num_episodes": 10,
}
# Update replay_buffer_config
replay_buffer_config = {
    "_enable_replay_buffer_api": True,
    "type": "MultiAgentPrioritizedReplayBuffer",
    "capacity": 1000,
    "prioritized_replay_alpha": 0.5,
    "prioritized_replay_beta": 0.5,
    "prioritized_replay_eps": 3e-6,
}
config = config.training(
    model={"fcnet_hiddens": [50, 50, 50]},
    lr=0.001,
    gamma=0.99,
    replay_buffer_config=replay_buffer_config,
    target_network_update_freq=500,
    double_q=True,
    dueling=True,
    num_atoms=1,
    noisy=False,
    n_step=3,
)

algo = DQN(config=config)
# algo = config.build()  # 2. build the algorithm,
no_improvement_counter = 0
prev_reward = None

# Get the current date
current_date = datetime.datetime.now().strftime('%Y%m%d')

# Open the csv file in write mode
with open(f'train_{current_date}.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    # Write the header row
    writer.writerow(["Iteration", "Reward_Mean", "Episode_Length_Mean"])

    for i in range(max_train_iter_times):
        print(f'#{i}: {algo.train()}\n')  # 3. train it,
        # Save the model every 5 iterations
        if (i + 1) % 10 == 0:
            checkpoint = algo.save()
            print("Model checkpoint saved at", checkpoint)

            eval_result = algo.evaluate()
            print(f'to evaluate model: {eval_result}')  # 4. and evaluate it.

            cur_reward = eval_result['evaluation']['sampler_results']['episode_reward_mean']
            cur_episode_len_mean = eval_result['evaluation']['sampler_results']['episode_len_mean']

            # Write the iteration, reward and episode length to csv
            writer.writerow([i + 1, cur_reward, cur_episode_len_mean])
            # Force the file to be written to disk immediately
            file.flush()
            os.fsync(file.fileno())

            if prev_reward is not None and cur_reward <= prev_reward:
                no_improvement_counter += 1
            else:
                no_improvement_counter = 0
            print(f'evaluated episode_reward_mean: {cur_reward}, no improvement counter: {no_improvement_counter}\n')
            if no_improvement_counter >= 20:
                print(f"Training stopped as the episode_reward_mean did not improve for 20 consecutive evaluations. totalIterNum: {i + 1}")
                break

            prev_reward = cur_reward

我尝试将 DQN replay_buffer_config 容量修改为 10000,n_step 修改为 20,但不起作用。结果是一样的。

reinforcement-learning q-learning dqn rllib
1个回答
0
投票

您在训练过程中看到什么进步吗?

假设您这样做,我查看您的配置文件的第一直觉是您正在测试时应用探索。您必须确保配置中的

explore
标志设置为
False
,即
DQNConfig.evaluation(evaluation_config={"explore": False})

© www.soinside.com 2019 - 2024. All rights reserved.