mat1 和 mat2 必须具有相同的 dtype,但具有 Byte 和 Float

问题描述 投票:0回答:1

我正在尝试为游戏 2048 实现深度 Q 网络强化学习代理。我遇到的问题是矩阵乘法期间数据类型不匹配,一个矩阵包含 Byte 类型的数据,另一个矩阵包含 Float 类型的数据。

我正在使用这个健身房环境 - https://github.com/Quentin18/gymnasium-2048

并按照tensorflow文档(https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html)设置DQN代理

导致问题的两行代码是:

next_state_values\[non_final_mask\] = target_net(non_final_next_states).max(1).values

x = F.relu(self.layer1(x))

我假设 target_values 是 mat1,x 是 mat2,但我不确定这一点。

我尝试在最后使用 .float() 将两行代码设置为浮点对象,但它是相同的错误代码

我在调试和打印方面都遇到问题;我已将其设置为 1 集进行调试,但由于数量或转换,我无法通过调试解决问题。我还尝试打印变量的数据类型,但是因为它位于方法内部,所以我无法弄清楚如何输出它。我使用 PyCharm。

我不确定代码是否有帮助,因为矩阵乘法涉及多个值,但将包括我提到的两种方法。

具体来说,任何有关调试此问题的正确方法的指导都非常棒。谢谢你。

class DQN(nn.Module):  # declare a deep q-network class
    def __init__(self, n_observations,
                 n_actions):  # constructor to initialise DQN taking the state space and actions as parameters
        super(DQN, self).__init__()  # calls constructor to nn.Module to properly initialise the DQN
        # define three fully connected layers of an NN
        self.layer1 = nn.Linear(16, 256)  # state space as input and outputs 128 features
        self.layer2 = nn.Linear(256, 256)  # input = 128 feat output = 128 feat
        self.layer3 = nn.Linear(256, n_actions)  # 128 feat as input outputs 4 features - actions for the env

    # defines how data flows through the network layer
    def forward(self, x):  # x = input state/batch of states
        x = F.relu(self.layer1(x)).float()  # apply ReLU activation function to layer 1 output
        x = F.relu(self.layer2(x))  # apply ReLU activation function to layer 2 output
        return self.layer3(x)  # returns Q-values for each action in given state

def optimize_model():
    if len(memory) < BATCH_SIZE:  # checks if enough transitions are stored to form a batch
        return
    transitions = memory.sample(BATCH_SIZE)  # samples a batch of transitions from replay memory
    batch = Transition(*zip(*transitions))  # converts batch-array of transitions to transition of batch-arrays
    # organises the batch so each component (state, action, reward, n_state) is seperated for easy access

    # prepare batch components to feed into an NN
    non_final_mask = torch.tensor(tuple(map(lambda s: s is not None, batch.n_state)),
                                  dtype=torch.bool)  # boolean mask to indicate which n_states are not final states
    non_final_next_states = torch.tensor([s for s in batch.n_state if s is not None])  # concatenates non-final n_states
    state_batch = torch.stack(
        [torch.tensor(s) for s in batch.state])  # Convert to tensors & concatenate states & add dimension
    action_batch = torch.stack([torch.tensor(s) for s in batch.action]).unsqueeze(
        1)  # Convert to tensors & concatenate actions & add dimension
    reward_batch = torch.stack([torch.tensor(s) for s in batch.reward])  # Convert to tensors & concatenate rewards

    state_action_values = policy_net(state_batch).gather(1,
                                                         action_batch)  # computes Q-values for state-actions pairs in
                                                                        # batch using policy network

    # computes expected Q-values for next states using target network, maximum Q-value for each non-final n_state
    next_state_values = torch.zeros(BATCH_SIZE)
    with torch.no_grad():
        next_state_values[non_final_mask] = target_net(non_final_next_states).max(1).values.float()
    # Compute the expected Q values using Bellman equation
    expected_state_action_values = (next_state_values * DISCOUNT_FACTOR) + reward_batch

    # Create huber loss function
    criterion = nn.SmoothL1Loss()
    # predicted Q-values by model, expected Q-values for state-action pairs, .unsqueeze to add extra dimension
    loss = criterion(state_action_values,
                     expected_state_action_values.unsqueeze(1))  # loss computed by predicted vs expected Q-values
    optimizer.zero_grad()  # clears parameters of the model
    loss.backward()  # computes new parameters with respect to the loss
    torch.nn.utils.clip_grad_value_(policy_net.parameters(),
                                    100)  # prevents parameters growing too large during training
    optimizer.step()  # update model parameters
deep-learning pytorch reinforcement-learning dqn
1个回答
0
投票

Karl 所说的应该做到。

或更改此行:

 non_final_next_states = torch.tensor([s for s in batch.n_state if s is not None])

 non_final_next_states = torch.tensor([s for s in batch.n_state if s is not None], dtype=torch.float)

根本原因应该是你的转换是字节数据类型。来回切换会对性能产生影响。因此,请确保使用 dtype float 保存转换。如果 torch 决定扰乱你的 dtype,我和 Karl 的建议都可以保留作为故障保险。

© www.soinside.com 2019 - 2024. All rights reserved.