将嵌入映射到 PyTorch/Huggingface 中的标签

问题描述 投票:0回答:2

我目前正在开展一个项目,其中使用预先训练的 Transformer 模型来生成 DNA 序列的嵌入(有些有“1”标签,有些有“0”标签)。我试图将这些嵌入映射回数据集中相应的标签,但在尝试这样做时遇到了 IndexError。我认为这与我正在批处理的事实有关,因为我的内存不足。

这是我正在使用的代码:

from datasets import Dataset
from transformers import AutoTokenizer, AutoModel
import torch

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-human-ref")
model = AutoModel.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-human-ref")

# Load the dataset
ds1 = Dataset.from_file('training.arrow') #this is already tokenized

# Convert tokenized sequences to tensor
inputs = torch.tensor(ds1['input_ids']).to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))

# Reduce batch size
batch_size = 4

# Pass tokenized sequences through the model with reduced batch size
with torch.no_grad():
    outputs = model(input_ids=inputs[:batch_size], output_hidden_states=True)

# Extract embeddings
hidden_states = outputs.hidden_states
embeddings1 = hidden_states[-1]

以下是有关输出嵌入的大小和原始数据集的信息:

embeddings1.shape
torch.Size([4, 86, 1280])


ds1
Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 22535512
})

我很难弄清楚如何将标签映射回输出嵌入,特别是在尺寸存在巨大差异的情况下。正如你所看到的,我有 2200 万个序列,我希望每个序列都有一个嵌入。

我的计划是使用这些嵌入使用另一个模型进行下游预测。 我已经将数据拆分为训练、测试和验证,但是获取 label1 数据集和 label0 数据集的嵌入然后合并然后拆分为训练/测试是否更有意义,所以我不必担心关于标签的映射?

python tensorflow pytorch huggingface-transformers word-embedding
2个回答
0
投票

您可以使用

map
函数来计算嵌入并将它们保存在同一数据集中

from transformers import DataCollatorWithPadding

collator = DataCollatorWithPadding(tokenizer, padding=True, return_tensors='pt')

def embed(batch):
    inputs = collator({'input_ids' : batch['input_ids']})
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        
    hidden_states = outputs.hidden_states
    embeddings = hidden_states[-1]
    return {'embeddings' : embeddings.detach().cpu()}

ds1 = ds1.map(embed, batched=True, batch_size=4)

0
投票

您可以使用数据集中的 .map 函数来附加嵌入。我建议您在 GPU 而不是 CPU 上运行它,因为行数非常高。

请尝试运行下面的代码。

import torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModel

device = torch.device("cuda" if torch.cuda.is_available() else "CPU")

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-human-ref")
model = AutoModel.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-human-ref", device_map = device)

# Load the dataset
ds = Dataset.from_file('training.arrow') #this is already tokenized

# Convert tokenized sequences to tensor
inputs = torch.tensor(ds['input_ids']).to(device)

# Reduce batch size
batch_size = 4

def get_embeddings(data):

    # Convert tokenized sequences to tensor
    input_ids =  torch.tensor(data['input_ids']).to(device)

    # Pass tokenized sequences through the model with reduced batch size
    with torch.no_grad():
        outputs = model(input_ids, output_hidden_states=True)
    
    hidden_states = outputs.hidden_states
    embeddings = hidden_states[-1]

    return {'embeddings' : embeddings.detach().cpu()}

# Extract embeddings
ds = ds.map(get_embeddings, batched=True, batch_size=batch_size)
ds
© www.soinside.com 2019 - 2024. All rights reserved.