如何制作可与 HuggingFace Transformers & Trainer 一起使用的“多头”回归数据加载器?

问题描述 投票:0回答:1

我正在研究一个多头回归问题,对于每个文本我想预测 5 个分数。您可以通过设置

problem_type = 'regression'
(如变压器代码中给出的

)来做到这一点

问题是,当我使用

Trainer
运行模型时,它会给出如下错误:

错误:

raise ValueError(
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

它可以与

num_classes = 1
一起使用,但是当我使用 5 时,它会抛出此错误。以下是我的模型数据的最小代码。

型号

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", 
                                                           num_labels=5, 
                                                          problem_type = "regression")

自定义数据加载器:

class MultiRegressionDataset(torch.utils.data.Dataset):
    def __init__(self, texts, labels):
        self.labels = labels
        self.texts = texts
        

    def __getitem__(self, idx, sanity_check = False):
        output = tokenizer(self.texts[idx], truncation=True,
                              padding="max_length",
                              max_length = 128) # This returns a dict

        output['labels'] = torch.tensor(self.labels[idx])
        
        return output

data = MultiRegressionDataset(["text1", "text2"], [[1,2,3,4,5], [5,4,3,2,1]])

data.__getitem__(0) # Gives a value

尝试用

来做
  1. output['labels'] = torch.tensor(self.labels[idx]).unsqueeze(-1)
  2. return_tensors = "pt"
    与上面的组合

没有任何作用。我在这里做错了什么?

pytorch nlp regression huggingface-transformers transformer-model
1个回答
0
投票

问题似乎可能与标签的格式以及标记化过程中的处理方式有关。该错误表明批次中不同样本的张量长度可能不一致。

这是您的

MultiRegressionDataset
类的修订版本,可能有助于解决问题:

class MultiRegressionDataset(torch.utils.data.Dataset):
    def __init__(self, texts, labels):
        self.labels = labels
        self.texts = texts
        self.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        encoding = self.tokenizer(self.texts[idx], truncation=True, padding="max_length", max_length=128, return_tensors="pt")
        
        # Ensure labels are converted to float
        labels = torch.tensor(self.labels[idx], dtype=torch.float32)
        
        # Flatten the labels if needed
        # labels = labels.view(-1)
        
        # Remove nested structure in the labels
        labels = labels.squeeze() if labels.dim() > 1 else labels

        # Update the encoding dictionary with the labels
        encoding['labels'] = labels

        return encoding

确保用实际的分词器和模型替换分词逻辑和模型加载。进行这些更改后,尝试再次训练您的模型,看看问题是否仍然存在。

© www.soinside.com 2019 - 2024. All rights reserved.