我正在尝试制作一个简单的 lstm 神经网络。我有时间序列数据,我使用 Pytorch 的
Dataset
和 DataLoader
将其分成序列和批次。为了考虑最后一批中序列的可变长度(因为数据用完),我使用填充和打包。
我在数据加载器中使用 collate_fn,如下所示:
def collate_data(batch):
sequences, targets = zip(*batch)
lens = [len(seq) for seq in sequences]
print(f"Lens before padding: {lens}")
padded_seq = pad_sequence(sequences=sequences,batch_first=True,
padding_value=float(9.99e10))
print(f"Lens after padding: {[len(seq) for seq in padded_seq]}")
padded_targets = pad_sequence(sequences=targets,batch_first=True,
padding_value=float(9.99e10))
packed_batch=pack_padded_sequence(padded_seq,lengths=lens,batch_first=True,\
enforce_sorted=False)
print(f"Packed batch lengths: {packed_batch.batch_sizes}")
return packed_batch, padded_targets
我的问题是当我尝试解压神经网络前向方法中的值时。我的转发方法如下所示:
def forward(self,x ):
lstm = self.lstm
batch_size = self.batch_size
h0 = torch.zeros(self.num_layers,batch_size,self.hidden_size,)
c0 = torch.zeros(self.num_layers,batch_size,self.hidden_size,)
packed_lstm_out, (hn,cn) = lstm(x, (h0,c0))
print(f"lstm_out size: {packed_lstm_out.data.size}")
unpacked_lstm_out = unpack_sequence(packed_sequences=packed_lstm_out,)
print(f"Unpacked lengths: {[len(seq) for seq in unpacked_lstm_out]}")
unpacked_lstm_tensor = torch.stack(unpacked_lstm_out,dim=0).float().\
requires_grad_(True)
print(unpacked_lstm_tensor.shape)
output = self.fc1(unpacked_lstm_tensor[:,-1,:])
return output
但是,当我尝试使用
torch.stack(unpacked_lstm_out, dim=0)
时出现错误,因为尺寸不同。这仅发生在最后一批上,应该进行填充。
我为最后一批添加了输出此内容的打印语句:
Lens before padding: [10, 10, 10, 10, 10, 10, 10, 9, 8, 7, 6, 5]
Lens after padding: [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]
Packed batch lengths: tensor([12, 12, 12, 12, 12, 11, 10, 9, 8, 7])
lstm_out size: torch.Size([105, 16])
Unpacked lengths: [10, 10, 10, 10, 10, 10, 10, 9, 8, 7, 6, 5]
我的理解是,当我使用
pack_padded_sequence()
时会出现此问题,但我不知道如何解决它或为什么会发生。
有谁知道如何解决这个问题,以便所有张量在前向函数中解包后大小相同?
unpack_sequence()
还删除了填充,因此序列不再填充到相同的长度,正如您所说。 unpacked_lstm_out
是长度为 list
的 batch_size
,其中每个样本的形状为 (sample's sequence length, hidden_size)
。
如果你想将它们堆叠为张量,你可以使用
pad_sequences(unpacked_lstm_out, batch_first=True)
再次打包它们。最终的形状将是 (batch_size, max_sequence_length, hidden_size)
。
考虑最后一批中序列的可变长度(因为数据用完)
也许你可以放弃那些较短的序列?您将丢失一些尾部数据,但这意味着您可以避免处理可变序列长度。或者,您可以编写一个自定义数据采样器,将相同大小的序列批量处理在一起(下面的示例)。在这两种情况下,您都可以使用常规张量(而不是打包序列),这更简单并且可以与其他
torch.nn
层无缝协作。
我之前用来从数据集中绘制批次的代码,其中每个批次具有相同长度的序列。例如,第一批可能是
(batch_size, sequences that all have length 5)
,下一个随机批次可能是 (batch_size, sequences that all have length 13)
。 SameLengthsBatchSampler
产生要使用的样本的索引,而不是样本本身。它提供给 batch_sampler=
的 DataLoader()
参数。
from torch.utils.data import Sampler
#Batch sampler: yields (B, sample indices where each sample has same seq_length).
class SameLengthsBatchSampler(Sampler):
def __init__(self, sentences, batch_size, drop_last=False):
lengths = [len(sentence) for sentence in sentences]
unique_lengths, counts = np.unique(lengths, return_counts=True)
#Only consider sequence lengths where count >= batch_size
unique_lengths = unique_lengths[counts >= batch_size]
counts = counts[counts >= batch_size]
same_lens_dict = {}
for length in unique_lengths:
same_lens_dict[length] = np.argwhere(lengths == length).ravel()
self.same_lens_dict = same_lens_dict #samples organised by sequence len
self.unique_lengths = unique_lengths
self.batch_size = batch_size
self.drop_last = drop_last
def __len__(self):
for i, _ in enumerate(self.__iter__()):
pass
return i
def __iter__(self):
for seq_len in self.unique_lengths[torch.randperm(len(self.unique_lengths))]:
#All samples with this length
sample_indices = torch.tensor(self.same_lens_dict[seq_len])
shuffled_ixs = sample_indices[torch.randperm(len(sample_indices))]
#Split tensor into batch-sized tensors
indices_per_batch = torch.split(shuffled_ixs, self.batch_size)
if self.drop_last and len(indices_per_batch[-1]) < self.batch_size:
indices_per_batch = indices_per_batch[:-1]
if False: #print batch details
print('sequence_length={} | yielding {} samples over {} batches'.format(
seq_len, len(sample_indices), len(indices_per_batch)
))
#yield over the batch indices
yield from indices_per_batch
#
# Batch data
#
batch_size = 32
train_loader = DataLoader(
train_dataset,
batch_sampler=SameLengthsBatchSampler(trn_sentences, batch_size)
)
val_loader = DataLoader(
val_dataset,
batch_sampler=SameLengthsBatchSampler(val_sentences, batch_size)
)