Huggingface 标记分类管道提供不同的输出，而不是直接调用模型（）

问题描述投票：0回答：1

我正在尝试使用 a roberta based model 掩盖文本中的命名实体。使用该模型的建议方法是通过 Huggingface 管道，但我发现以这种方式使用它相当慢。在文本数据上使用管道也阻止我使用我的 GPU 进行计算，因为文本不能放到 GPU 上。

因此，我决定将模型放在 GPU 上，自己标记文本（使用我传递给管道的相同标记器），将标记放在 GPU 上，然后将它们传递给模型。这行得通，但是像这样直接使用而不是通过管道使用的模型的输出显着不同。我找不到原因，也找不到解决方法。

我尝试通读令牌分类管道源代码，但与管道所做的相比，我的用法没有发现差异。

产生不同结果的代码示例：

模型卡中的建议用法：

ner_tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
classifier = pipeline("ner", model=model, tokenizer=ner_tokenizer, framework='pt')
out = classifier(dataset['text'])

'out' 现在是字典对象列表的列表，它包含字符串列表'dataset['text']' 中给定字符串中每个命名实体的信息。

我的自定义用法：

text_batch = dataset['text']
encodings_batch = ner_tokenizer(text_batch,padding="max_length", truncation=True, max_length=128, return_tensors="pt")
input_ids = encodings_batch['input_ids']
input_ids = input_ids.to(TORCH_DEVICE)
outputs = model(input_ids)[0]
outputs = outputs.to('cpu')
label_ner_ids = outputs.argmax(dim=2).to('cpu')

'label_ner_ids'现在是一个2维的张量，其中的元素代表给定文本行中每个token的标签，所以label_ner_id[i,j]是第i行中第j个token的标签字符串列表“text_batch”中的文本字符串。 此处的令牌标签与管道使用的输出不同。

pytorch huggingface-transformers named-entity-recognition huggingface-tokenizers huggingface

1个回答

0
投票

pipeline

支持在GPU上处理。您需要做的就是通过device：

from transformers import pipeline

model_id = "xlm-roberta-large-finetuned-conll03-english"

classifier = pipeline("ner", model=model_id, device=TORCH_DEVICE, framework='pt')
out = classifier(dataset['text'])

最新问题

© www.soinside.com 2019 - 2024. All rights reserved.