我正在按照 https://huggingface.co/docs/transformers/pipeline_tutorial 上的教程使用变换器管道进行推理。例如,以下代码片段适用于从 ner 管道获取 NER 结果。
# KeyDataset is a util that will just output the item we're interested in.
from transformers.pipelines.pt_utils import KeyDataset
from datasets import load_dataset
model = ...
tokenizer = ...
pipe = pipeline("ner", model=model, tokenizer=tokenizer)
dataset = load_dataset("my_ner_dataset", split="test")
for extracted_entities in pipe(KeyDataset(dataset, "text")):
print(extracted_entities)
在 NER 以及许多应用程序中,我们还希望获得输入,以便我可以将结果存储为 (text, extracted_entities) 对以供以后处理。基本上我在寻找类似的东西:
# KeyDataset is a util that will just output the item we're interested in.
from transformers.pipelines.pt_utils import KeyDataset
from datasets import load_dataset
model = ...
tokenizer = ...
pipe = pipeline("ner", model=model, tokenizer=tokenizer)
dataset = load_dataset("my_ner_dataset", split="test")
for text, extracted_entities in pipe(KeyDataset(dataset, "text")):
print(text, extracted_entities)
其中
text
是输入管道的原始输入文本(可能是批处理的)。
这可行吗?
# Datasets 2.11.0
from datasets import load_dataset
# Transformers 4.27.4, Torch 2.0.0+cu118,
from transformers import (
AutoTokenizer,
AutoModelForTokenClassification,
pipeline
)
from transformers.pipelines.pt_utils import KeyDataset
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
pipe = pipeline(task="ner", model=model, tokenizer=tokenizer)
dataset = load_dataset("argilla/gutenberg_spacy-ner", split="train")
results = pipe(KeyDataset(dataset, "text"))
for idx, extracted_entities in enumerate(results):
print("Original text:\n{}".format(dataset[idx]["text"]))
print("Extracted entities:")
for entity in extracted_entities:
print(entity)
Original text:
Would I wish to send up my name now ? Again I declined , to the polite astonishment of the concierge , who evidently considered me a queer sort of a friend . He was called to his desk by a guest , who wished to ask questions , of course , and I waited where I was . At a quarter to eleven Herbert Bayliss emerged from the elevator . His appearance almost shocked me . Out late the night before ! He looked as if he had been out all night for many nights .
Extracted entities:
{'entity': 'B-PER', 'score': 0.9996532, 'index': 68, 'word': 'Herbert', 'start': 289, 'end': 296}
{'entity': 'I-PER', 'score': 0.9996567, 'index': 69, 'word': 'Bay', 'start': 297, 'end': 300}
{'entity': 'I-PER', 'score': 0.9991698, 'index': 70, 'word': '##lis', 'start': 300, 'end': 303}
{'entity': 'I-PER', 'score': 0.96547437, 'index': 71, 'word': '##s', 'start': 303, 'end': 304}
...
Original text:
And you think our run will be better than five hundred and eighty ? '' `` It should be , unless there is a remarkable change . This ship makes over six hundred , day after day , in good weather . She should do at least six hundred by to-morrow noon , unless there is a sudden change , as I said . '' `` But six hundred would be -- it would be the high field , by Jove ! '' `` Anything over five hundred and ninety-four would be that . The numbers are very low to-night .
Extracted entities:
{'entity': 'B-MISC', 'score': 0.40225995, 'index': 90, 'word': 'Jo', 'start': 363, 'end': 365}
load_dataset
调用创建的数据集中的每个样本都可以使用索引和关联的字典键访问。
以
pipeline
作为输入调用 KeyDataset
对象返回可迭代的 PipelineIterator
对象。因此,可以enumerate
PipelineIterator 对象来获取结果和特定结果的索引,然后使用该索引检索数据集中的关联样本。
Huggingface pipeline 抽象是所有可用管道的包装器。当实例化一个
pipeline
对象时,它将根据 task
参数返回适当的管道:
pipe = pipeline(task="ner", model=model, tokenizer=tokenizer)
鉴于指定了 NER 任务,将返回一个 TokenClassificationPipeline(旁注:“ner”是“token-classification”的别名)。该管道(以及所有其他管道)继承了基类Pipeline。
Pipeline
基类定义了 __call__
函数,每当调用实例化的 TokenClassificationPipeline
时,pipeline
类 所依赖的。
一旦管道被实例化(见上文),它就会被调用,数据作为单个字符串、列表传入,或者在处理完整数据集时,通过 transformers.pipelines.pt_utils KeyDataset 类传递的 Huggingface 数据集 .
dataset = load_dataset("argilla/gutenberg_spacy-ner", split="train")
results = pipe(KeyDataset(dataset, "text")) # pipeline call
调用管道时,检查传入的数据是否可迭代,然后调用适当的函数。对于 Huggingface
Dataset
对象,get_iterator
函数 被调用 返回一个 PipelineIterator 对象。给定迭代器对象的已知行为,可以枚举对象返回一个元组,其中包含一个计数(从开始开始,默认为 0)和迭代可迭代对象获得的值。这些值是数据集中每个样本的 NER 提取。因此,以下会产生预期的结果:
for idx, extracted_entities in enumerate(results):
print("Original text:\n{}".format(dataset[idx]["text"]))
print("Extracted entities:")
for entity in extracted_entities:
print(entity)