我正在尝试为游戏 2048 实现深度 Q 网络强化学习代理。我遇到的问题是矩阵乘法期间数据类型不匹配,一个矩阵包含 Byte 类型的数据,另一个矩阵包含 Float 类型的数据。
我正在使用这个健身房环境 - https://github.com/Quentin18/gymnasium-2048
并按照tensorflow文档(https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html)设置DQN代理
导致问题的两行代码是:
next_state_values\[non_final_mask\] = target_net(non_final_next_states).max(1).values
x = F.relu(self.layer1(x))
我假设 target_values 是 mat1,x 是 mat2,但我不确定这一点。
我尝试在最后使用 .float() 将两行代码设置为浮点对象,但它是相同的错误代码
我在调试和打印方面都遇到问题;我已将其设置为 1 集进行调试,但由于数量或转换,我无法通过调试解决问题。我还尝试打印变量的数据类型,但是因为它位于方法内部,所以我无法弄清楚如何输出它。我使用 PyCharm。
我不确定代码是否有帮助,因为矩阵乘法涉及多个值,但将包括我提到的两种方法。
具体来说,任何有关调试此问题的正确方法的指导都非常棒。谢谢你。
class DQN(nn.Module): # declare a deep q-network class
def __init__(self, n_observations,
n_actions): # constructor to initialise DQN taking the state space and actions as parameters
super(DQN, self).__init__() # calls constructor to nn.Module to properly initialise the DQN
# define three fully connected layers of an NN
self.layer1 = nn.Linear(16, 256) # state space as input and outputs 128 features
self.layer2 = nn.Linear(256, 256) # input = 128 feat output = 128 feat
self.layer3 = nn.Linear(256, n_actions) # 128 feat as input outputs 4 features - actions for the env
# defines how data flows through the network layer
def forward(self, x): # x = input state/batch of states
x = F.relu(self.layer1(x)).float() # apply ReLU activation function to layer 1 output
x = F.relu(self.layer2(x)) # apply ReLU activation function to layer 2 output
return self.layer3(x) # returns Q-values for each action in given state
def optimize_model():
if len(memory) < BATCH_SIZE: # checks if enough transitions are stored to form a batch
return
transitions = memory.sample(BATCH_SIZE) # samples a batch of transitions from replay memory
batch = Transition(*zip(*transitions)) # converts batch-array of transitions to transition of batch-arrays
# organises the batch so each component (state, action, reward, n_state) is seperated for easy access
# prepare batch components to feed into an NN
non_final_mask = torch.tensor(tuple(map(lambda s: s is not None, batch.n_state)),
dtype=torch.bool) # boolean mask to indicate which n_states are not final states
non_final_next_states = torch.tensor([s for s in batch.n_state if s is not None]) # concatenates non-final n_states
state_batch = torch.stack(
[torch.tensor(s) for s in batch.state]) # Convert to tensors & concatenate states & add dimension
action_batch = torch.stack([torch.tensor(s) for s in batch.action]).unsqueeze(
1) # Convert to tensors & concatenate actions & add dimension
reward_batch = torch.stack([torch.tensor(s) for s in batch.reward]) # Convert to tensors & concatenate rewards
state_action_values = policy_net(state_batch).gather(1,
action_batch) # computes Q-values for state-actions pairs in
# batch using policy network
# computes expected Q-values for next states using target network, maximum Q-value for each non-final n_state
next_state_values = torch.zeros(BATCH_SIZE)
with torch.no_grad():
next_state_values[non_final_mask] = target_net(non_final_next_states).max(1).values.float()
# Compute the expected Q values using Bellman equation
expected_state_action_values = (next_state_values * DISCOUNT_FACTOR) + reward_batch
# Create huber loss function
criterion = nn.SmoothL1Loss()
# predicted Q-values by model, expected Q-values for state-action pairs, .unsqueeze to add extra dimension
loss = criterion(state_action_values,
expected_state_action_values.unsqueeze(1)) # loss computed by predicted vs expected Q-values
optimizer.zero_grad() # clears parameters of the model
loss.backward() # computes new parameters with respect to the loss
torch.nn.utils.clip_grad_value_(policy_net.parameters(),
100) # prevents parameters growing too large during training
optimizer.step() # update model parameters
Karl 所说的应该做到。
或更改此行:
non_final_next_states = torch.tensor([s for s in batch.n_state if s is not None])
到
non_final_next_states = torch.tensor([s for s in batch.n_state if s is not None], dtype=torch.float)
根本原因应该是你的转换是字节数据类型。来回切换会对性能产生影响。因此,请确保使用 dtype float 保存转换。如果 torch 决定扰乱你的 dtype,我和 Karl 的建议都可以保留作为故障保险。