我正在研究一个多头回归问题,对于每个文本我想预测 5 个分数。您可以通过设置
problem_type = 'regression'
(如变压器代码中给出的)来做到这一点
问题是,当我使用
Trainer
运行模型时,它会给出如下错误:
raise ValueError(
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
它可以与
num_classes = 1
一起使用,但是当我使用 5 时,它会抛出此错误。以下是我的模型数据的最小代码。
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased",
num_labels=5,
problem_type = "regression")
class MultiRegressionDataset(torch.utils.data.Dataset):
def __init__(self, texts, labels):
self.labels = labels
self.texts = texts
def __getitem__(self, idx, sanity_check = False):
output = tokenizer(self.texts[idx], truncation=True,
padding="max_length",
max_length = 128) # This returns a dict
output['labels'] = torch.tensor(self.labels[idx])
return output
data = MultiRegressionDataset(["text1", "text2"], [[1,2,3,4,5], [5,4,3,2,1]])
data.__getitem__(0) # Gives a value
尝试用
来做output['labels'] = torch.tensor(self.labels[idx]).unsqueeze(-1)
return_tensors = "pt"
与上面的组合没有任何作用。我在这里做错了什么?
问题似乎可能与标签的格式以及标记化过程中的处理方式有关。该错误表明批次中不同样本的张量长度可能不一致。
这是您的
MultiRegressionDataset
类的修订版本,可能有助于解决问题:
class MultiRegressionDataset(torch.utils.data.Dataset):
def __init__(self, texts, labels):
self.labels = labels
self.texts = texts
self.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
encoding = self.tokenizer(self.texts[idx], truncation=True, padding="max_length", max_length=128, return_tensors="pt")
# Ensure labels are converted to float
labels = torch.tensor(self.labels[idx], dtype=torch.float32)
# Flatten the labels if needed
# labels = labels.view(-1)
# Remove nested structure in the labels
labels = labels.squeeze() if labels.dim() > 1 else labels
# Update the encoding dictionary with the labels
encoding['labels'] = labels
return encoding
确保用实际的分词器和模型替换分词逻辑和模型加载。进行这些更改后,尝试再次训练您的模型,看看问题是否仍然存在。