如何提高Deep Q Learning Network在Mountain Car问题上的表现？

Question

我正在解决一些 OpenAI Gym 问题，似乎被 Mountain Car 难倒了。我知道我的 Deep Q-Learning 代理正在工作，因为它可以可靠地学习在月球着陆器上获得 200+ 分。但是用在Mountain车上好像真的很吃力：

我试过使用一系列不同的超参数，包括网络架构（层数）、学习率和 epsilon 衰减（epsilon-greedy 选择）。不幸的是，它们似乎都没有对性能做出重大改变。 解决 Mountain Car 的最佳超参数是什么？ 或者我的实现可能有问题？

我所有的代码都在下面。我对它进行了组织，以便它（希望）无需任何调整即可为您工作。


##IMPORTS##
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from statistics import mean
import numpy as np
import pandas as pd
import random
import sys
from collections import deque, defaultdict, namedtuple
import copy

import gym

##HYPERPARAMS##
SEED = 0
if SEED is not None:
    torch.manual_seed(SEED)
    np.random.seed(SEED)
    random.seed(SEED)
    print(f"Random Seed: {SEED}")

#Set where computations happen
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Set Environments
env = gym.make("MountainCar-v0")
eval_env = gym.make("MountainCar-v0", render_mode="human")
    
NUM_ACTIONS = env.action_space.n
STATE_DIM = env.observation_space.shape[0]

# Sizes of the hidden layers in the neural network
HL1_SIZE = STATE_DIM * 8  
HL2_SIZE = STATE_DIM * 8  

# Epsilon-related params
EPSILON = 1.0  #Initial Epsilon Value (action choice is random)
EPSILON_DECAY = 0.99999  #Epsilon Decay Rate
EPSILON_MIN = 0.01  #Minimum Epsilon Value (1% chance of sampling random action)

# Experience Replay related params
REPLAY_SIZE = 100_000 # Max Size of buffer
MIN_MEMORY_SIZE = 1000 #Minimum amount of experiences needed before sampling/updates begin

MAX_EPISODES = 1500
MAX_EPISODE_LENGTH = 1500
BATCH_SIZE = 512
GAMMA = 0.99
LR = 5e-4 
UPDATE_RATE = 50 #The rate of overwritting the target network with the main network
MAIN_UPDATE_PERIOD = 4  #Number of actions chosen using main network before it is updated (reduce instability) 
EVAL_PERIOD = 100  #How often to evaluate 


##Set Up Classes##
class DQNet(nn.Module):
    
    def __init__(self, stateDim, actionDim):
        super().__init__()
        
        #Hidden Layer 1
        self.fc1 = nn.Sequential(
            nn.Linear(in_features=stateDim, out_features=HL1_SIZE),
            nn.ReLU(True))
        
        #Hidden Layer 2
        self.fc2 = nn.Sequential(
            nn.Linear(in_features=HL1_SIZE, out_features=HL2_SIZE),
            nn.ReLU(True))

        #Output Layer
        #No activation function because we are estimating Q values
        self.fcOutput = nn.Linear(in_features = HL2_SIZE, out_features = actionDim)  
        
    def forward(self, x):
        
        out = self.fc1(x)
        out = self.fc2(out)
        out = self.fcOutput(out)
        
        return out 

    
class ReplayMemory(object):
    
    def __init__(self, replaySize, batchSize):

        self.batchSize = batchSize
        self.memory = deque(maxlen = replaySize)
        self.experience = namedtuple("Experience", 
                                     field_names=["State", "Action", "NextState", "Reward"])

    def addExperience(self, state, action, nextState, reward):
        
        experience = self.experience(state, action, nextState, reward)
        self.memory.append(experience)
                
    def sample(self):
        
        batchSize = min(self.batchSize, len(self))
        return random.sample(self.memory, batchSize)
                
    def __len__(self):
        return len(self.memory)



class Agent:
    
    def __init__(self, stateSize, numActions):

        
        self.stateSize = stateSize
        self.numActions = numActions
    
        self.netPolicy = DQNet(stateSize, numActions).to(device)
        
        self.netTarget = DQNet(stateSize, numActions).to(device)
        self.netTarget.load_state_dict(self.netPolicy.state_dict()) 
        
        self.optimizer = optim.Adam(self.netPolicy.parameters(), lr = LR)
        self.loss = nn.MSELoss()
        
        self.memory = ReplayMemory(REPLAY_SIZE, BATCH_SIZE)    

        self.numUpdates = 0
        
    

    def update(self):
        
        batch = self.memory.sample()

        states = [experience.State for experience in batch]
        states = torch.tensor(states, dtype=torch.float32, device=device)
        
        actions = [experience.Action for experience in batch]
        actions = torch.tensor(actions, dtype=torch.int64, device=device)
        
        actions = actions.unsqueeze(1).to(device)
        

        rewards = [experience.Reward for experience in batch]
        rewards = torch.tensor(rewards, dtype=torch.float32, device=device)
        
        # If the nextState in ReplayMemory is None, then replace it with a vector of zeros
        nextStates = [[0] * STATE_DIM if experience.NextState is None else experience.NextState
                      for experience in batch]
        nextStates = torch.tensor(nextStates, dtype=torch.float32, device=device)
        noNextStateFilter = torch.tensor([experience.NextState is not None for experience in batch],
                                         dtype=torch.int)

        
        with torch.no_grad():
            
            self.netTarget.eval()
            allTargetQVals = self.netTarget(nextStates)
            # maxTargetQVals should be 0 if terminal state
            maxTargetQVals = allTargetQVals.max(dim=1)[0] * noNextStateFilter
            
        trueQVals = rewards + (GAMMA * maxTargetQVals )
        trueQVals = trueQVals.unsqueeze(1) #make it a a 2-d tensor of shape: (numrow = BATCH_SIZE, numcol = 1) 
        
      
        self.netPolicy.train()
        allQVals = self.netPolicy(states)
        # Generate all Q values (Qvalue of every action for a given state)
        
        # Only keep the qvalues that correspond to the action which was actually experienced
        predictedQVals = torch.gather(input = allQVals, dim = 1, index = actions)

        loss = self.loss(predictedQVals, trueQVals)

        self.optimizer.zero_grad()
        loss.backward()
        
        ## Gradient clipping for training stability (clip all the gradients greater than 3)
        # nn.utils.clip_grad_norm_(self.netPolicy.parameters(), 3)
        
        self.optimizer.step()
            
        self.numUpdates += 1
        
        if self.numUpdates % UPDATE_RATE == 0:
            #print(f"Update: {self.numUpdates} - Overriding Target Network...") 
            self.netTarget.load_state_dict(self.netPolicy.state_dict()) 
            
    
    # While evaluating a policy (as opposed to while training) act greedily
    def act_epsilon_greedy(self, state, epsilon, evaluate=False):
        if epsilon > 1 or epsilon < 0:
            raise Exception('Value of epsilon must be between 0 and 1')

        with torch.no_grad():
            self.netPolicy.eval()
            state = torch.tensor(state, dtype=torch.float32) 
            out = self.netPolicy(state)

        maxAction = int(out.argmax())

        if not evaluate and random.random() < epsilon:
            # While exploring sample uniformly from all actions, including the max-action
            action = random.choice(range(NUM_ACTIONS))

        else:
            # While exploiting or during policy evaluation, choose the greedy action
            action = maxAction

        return action
    

    def learn(self, state, action, nextState, reward):
        
        self.memory.addExperience(state, action, nextState, reward)

##Training##
currentEpisode = 1
scoresList = []
epi_len_list = list()
agent = Agent(stateSize = STATE_DIM, numActions = NUM_ACTIONS)

agentsList = []

state, _ = env.reset()


MAX_EPISODES = 400

while currentEpisode < MAX_EPISODES + 1:
    
    score = 0
    episode_length = 0
    terminated = False
    truncated = False

    while not terminated and not truncated and episode_length < MAX_EPISODE_LENGTH:
        
        action = agent.act_epsilon_greedy(state, EPSILON)
        nextState, reward, terminated, truncated , _ = env.step(action)
        
        if terminated and not truncated:
            # On termination the nextState is a vector of floating number representing
            # the state of the environment. So, make the nextState 'None' explicitly
            
            nextState = None
            
        agent.learn(state, action, nextState, reward)
        state = nextState       
        score += reward
        episode_length += 1
       
        if len(agent.memory) > MIN_MEMORY_SIZE:
            # Decay exploration parameter Ɛ to a min of EPSILON_MIN:
            # Ɛₜ = Ɛ*(decay)ᵗ
            if EPSILON > EPSILON_MIN:
                EPSILON *= EPSILON_DECAY
            
            if episode_length % MAIN_UPDATE_PERIOD == 0:    
                agent.update()
                
    # Evaluate current policy by number of episodes (vs by number of updates)
    if currentEpisode % EVAL_PERIOD == 0:
            
        tempAgent = copy.deepcopy(agent)
        agentsList.append(tempAgent)
    
    #if len(agent.memory) > MIN_MEMORY_SIZE:
    scoresList.append(score)
    epi_len_list.append(episode_length)
    
    if score > -200:

        print(f"Episode: {currentEpisode} - Score: {score} - Length: {episode_length} - " +
              f"Epsilon: {EPSILON}")
        
    currentEpisode += 1 
        
    state, _ = env.reset()

Answer 1

你的问题比有一个精确解决方案的精确编码问题更开放。

所以我可以根据我的经验通过一些一般性建议来帮助您提高性能。我的建议是.

1- 增加神经网络中隐藏层和神经元的数量，以提高其表示能力。例如，您可以尝试将 HL1_SIZE 和 HL2_SIZE 增加到 64 或 128。

2- 将学习率 (LR) 降低到 1e-4 和 1e-5 之间的值，以帮助代理更快地收敛并避免振荡。

3- 将批量大小 (BATCH_SIZE) 增加到 1024 或更多以获得更稳定的更新。

4- 将回放内存缓冲区大小 (REPLAY_SIZE) 增加到 100 万或更多以存储更多经验并提高用于训练的数据质量。

5- 将更新目标网络的频率（UPDATE_RATE）提高到100或更多，以提高学习过程的稳定性。

6- 尝试使用 0.95 到 0.999 之间的不同折扣因子 (GAMMA) 值，看看它如何影响性能。

7- 最后，您可以尝试不同的探索策略，例如降低 epsilon 衰减率 (EPSILON_DECAY) 或使用不同的探索方法，例如玻尔兹曼探索。

如果我们考虑如何提高您已有代码的性能，我建议：

1- 用更高效的数据结构替换双端队列：双端队列是一个动态数组，可以从两端进行索引。它对于实现队列和堆栈很有用。然而，它并不是实现重播缓冲区最有效的数据结构。相反，您可以使用 numpy 数组或固定大小的循环缓冲区。这些数据结构对于随机访问更有效，并且在添加或删除元素时不需要重新分配内存。

2- 批量计算：您可以使用矩阵运算为一批状态-动作对计算 Q 值，而不是单独计算每个状态-动作对的 Q 值。这样效率更高，因为 GPU 针对矩阵运算的并行计算进行了优化。要批量计算，您可以使用 torch.cat() 堆叠状态和动作的张量，然后使用对神经网络的单个调用来计算 Q 值。

3- 使用目标网络：您可以通过使用目标网络来提高学习算法的稳定性。目标网络是用于计算目标 Q 值的策略网络的副本。目标网络的权重会定期更新（例如每 1000 步）以匹配策略网络的权重。这可以防止目标 Q 值在学习过程中发生振荡。

4- 减少 with torch.no_grad() 块的数量：torch.no_grad() 块禁用梯度计算并减少前向传播期间的内存使用。但是，它也增加了计算开销。您可以通过计算同一块中目标网络和策略网络的 Q 值来减少 torch.no_grad() 块的数量。

5- 使用学习率计划：您可以使用随着时间的推移降低学习率的学习率计划，而不是使用固定的学习率。这可以帮助学习算法更快地收敛并避免振荡。

6- 使用更复杂的探索策略：当前的探索策略是一个简单的 epsilon-greedy 策略，其中 agent 选择概率为 epsilon 的随机动作和概率为 1-epsilon 的贪心动作。有更复杂的探索策略可以提高学习效率，例如玻尔兹曼探索、UCB 探索或噪声网络。

7- 尝试不同的网络架构：当前的网络架构有两个带有 ReLU 激活函数的隐藏层。您可以尝试不同的网络架构，例如更深的网络、更宽的网络或具有不同激活函数的网络。如果状态空间是图像，您也可以尝试使用卷积神经网络。

如何提高Deep Q Learning Network在Mountain Car问题上的表现？

问题描述投票：0回答：1

1个回答

最新问题

如何提高Deep Q Learning Network在Mountain Car问题上的表现？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1