使用 PyTorch 在 LSTM 网络中填充和打包序列时出现问题

问题描述 投票:0回答:1

我正在尝试制作一个简单的 lstm 神经网络。我有时间序列数据,我使用 Pytorch 的

Dataset
DataLoader
将其分成序列和批次。为了考虑最后一批中序列的可变长度(因为数据用完),我使用填充和打包。

我在数据加载器中使用 collate_fn,如下所示:

def collate_data(batch):
    sequences, targets = zip(*batch)
    
    lens = [len(seq) for seq in sequences]
    print(f"Lens before padding: {lens}")

    padded_seq = pad_sequence(sequences=sequences,batch_first=True,
    padding_value=float(9.99e10))

    print(f"Lens after padding: {[len(seq) for seq in padded_seq]}")

    padded_targets = pad_sequence(sequences=targets,batch_first=True,
    padding_value=float(9.99e10))

    packed_batch=pack_padded_sequence(padded_seq,lengths=lens,batch_first=True,\
    enforce_sorted=False)

    print(f"Packed batch lengths: {packed_batch.batch_sizes}")

    return packed_batch, padded_targets

我的问题是当我尝试解压神经网络前向方法中的值时。我的转发方法如下所示:

 def forward(self,x ):
        lstm = self.lstm
        batch_size = self.batch_size

        h0 = torch.zeros(self.num_layers,batch_size,self.hidden_size,)   
        c0 = torch.zeros(self.num_layers,batch_size,self.hidden_size,)

        packed_lstm_out, (hn,cn) = lstm(x, (h0,c0))
        
        print(f"lstm_out size: {packed_lstm_out.data.size}")        
        unpacked_lstm_out = unpack_sequence(packed_sequences=packed_lstm_out,)        
        print(f"Unpacked lengths: {[len(seq) for seq in unpacked_lstm_out]}")

        unpacked_lstm_tensor = torch.stack(unpacked_lstm_out,dim=0).float().\
        requires_grad_(True)

        print(unpacked_lstm_tensor.shape)

        output = self.fc1(unpacked_lstm_tensor[:,-1,:])

        return output

但是,当我尝试使用

torch.stack(unpacked_lstm_out, dim=0)
时出现错误,因为尺寸不同。这仅发生在最后一批上,应该进行填充。

我为最后一批添加了输出此内容的打印语句:

Lens before padding: [10, 10, 10, 10, 10, 10, 10, 9, 8, 7, 6, 5]
Lens after padding: [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]
Packed batch lengths: tensor([12, 12, 12, 12, 12, 11, 10,  9,  8,  7])
lstm_out size: torch.Size([105, 16])
Unpacked lengths: [10, 10, 10, 10, 10, 10, 10, 9, 8, 7, 6, 5]

我的理解是,当我使用

pack_padded_sequence()
时会出现此问题,但我不知道如何解决它或为什么会发生。

有谁知道如何解决这个问题,以便所有张量在前向函数中解包后大小相同?

python pytorch recurrent-neural-network data-preprocessing
1个回答
0
投票

unpack_sequence()
还删除了填充,因此序列不再填充到相同的长度,正如您所说。
unpacked_lstm_out
是长度为
list
batch_size
,其中每个样本的形状为
(sample's sequence length, hidden_size)

如果你想将它们堆叠为张量,你可以使用

pad_sequences(unpacked_lstm_out, batch_first=True)
再次打包它们。最终的形状将是
(batch_size, max_sequence_length, hidden_size)

考虑最后一批中序列的可变长度(因为数据用完)

也许你可以放弃那些较短的序列?您将丢失一些尾部数据,但这意味着您可以避免处理可变序列长度。或者,您可以编写一个自定义数据采样器,将相同大小的序列批量处理在一起(下面的示例)。在这两种情况下,您都可以使用常规张量(而不是打包序列),这更简单并且可以与其他

torch.nn
层无缝协作。


我之前用来从数据集中绘制批次的代码,其中每个批次具有相同长度的序列。例如,第一批可能是

(batch_size, sequences that all have length 5)
,下一个随机批次可能是
(batch_size, sequences that all have length 13)
SameLengthsBatchSampler
产生要使用的样本的索引,而不是样本本身。它提供给
batch_sampler=
DataLoader()
参数。

from torch.utils.data import Sampler

#Batch sampler: yields (B, sample indices where each sample has same seq_length).
class SameLengthsBatchSampler(Sampler):
    def __init__(self, sentences, batch_size, drop_last=False):
        lengths = [len(sentence) for sentence in sentences]
        unique_lengths, counts = np.unique(lengths, return_counts=True)
        
        #Only consider sequence lengths where count >= batch_size
        unique_lengths = unique_lengths[counts >= batch_size]
        counts = counts[counts >= batch_size]
        
        same_lens_dict = {}
        for length in unique_lengths:
            same_lens_dict[length] = np.argwhere(lengths == length).ravel()
        
        self.same_lens_dict = same_lens_dict #samples organised by sequence len
        self.unique_lengths = unique_lengths
        self.batch_size = batch_size
        self.drop_last = drop_last
    
    def __len__(self):
        for i, _ in enumerate(self.__iter__()):
            pass
        return i
        
    def __iter__(self):
        for seq_len in self.unique_lengths[torch.randperm(len(self.unique_lengths))]:
            #All samples with this length
            sample_indices = torch.tensor(self.same_lens_dict[seq_len])
            shuffled_ixs = sample_indices[torch.randperm(len(sample_indices))]
        
            #Split tensor into batch-sized tensors
            indices_per_batch = torch.split(shuffled_ixs, self.batch_size)
            
            if self.drop_last and len(indices_per_batch[-1]) < self.batch_size:
                indices_per_batch = indices_per_batch[:-1]
            
            if False: #print batch details
                print('sequence_length={} | yielding {} samples over {} batches'.format(
                    seq_len, len(sample_indices), len(indices_per_batch)
                ))
            
            #yield over the batch indices
            yield from indices_per_batch

#
# Batch data
#
batch_size = 32

train_loader = DataLoader(
    train_dataset,
    batch_sampler=SameLengthsBatchSampler(trn_sentences, batch_size)
)

val_loader = DataLoader(
    val_dataset,
    batch_sampler=SameLengthsBatchSampler(val_sentences, batch_size)
)

此处提供了有关此类功能的一些讨论。

© www.soinside.com 2019 - 2024. All rights reserved.