我正在尝试使用 spaCy 从文本中提取特定信息。 因此,我需要配置一个自定义标记器来识别它们,并配置一个自定义标记器来以 JSON 格式标记外部词典中的所有单词。
分词器进行了多次尝试,但贴标器在处理简单文本时遇到了问题。 我希望添加到单词中的标签是自定义 POS 标签“UNM”,并且我可以将其归因于 token.pos_ 就像所有其他标签“NOUN”、“VERB”等。
import requests
#keywords dictionary
dictionary = requests.get(
"https://github.com/dglopes/NBR15575/raw/main/unidades_medidas.json").json()
import spacy
from spacy.tokenizer import Tokenizer
from spacy.tokens import Doc
from spacy.language import Language
!python -m spacy download pt_core_news_md
#Custom Tokenizer
class NBRTokenizer(Tokenizer):
def __init__(self, vocab):
super().__init__(vocab)
for unit in dicionario_unidades:
self.add_special_case(unit, [{ORTH: unit}])
#Creating custom tagger
@Language.component("keyword_pos_tagger")
class KeywordPosTagger:
def __init__(self, nlp, keywords, pos_tag):
self.keywords = keywords
self.pos_tag = pos_tag
Doc.set_extension('pos_tag', default=None, force=True)
def __call__(self, doc):
for token in doc:
if token.text in self.keywords:
token._.pos_tag = self.pos_tag
return doc
nlp = spacy.load('pt_core_news_md')
keywords = dictionary
pos_tag = 'UNM'
keyword_pos_tagger = KeywordPosTagger(nlp, keywords, pos_tag)
nlp.add_pipe('keyword_pos_tagger')
并使用自定义标记器:
doc = nlp('A temperatura tem 159ºC ou 20 ºC. Também precisa ter 20m de largura e 14 m² de área, caso contrário terá 1 Kelvin (W/K)')
for token in doc:
print(token.text, token._.pos_tag)
但它返回:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-10-3c241e1c89fd> in <cell line: 1>()
----> 1 doc = nlp('A temperatura tem 159ºC ou 20 ºC. Também precisa ter 20m de largura e 14 m² de área, caso contrário terá 1 Kelvin (W/K)')
2 for token in doc:
3 print(token.text, token._.pos_tag)
2 frames
/usr/local/lib/python3.10/dist-packages/spacy/language.py in __call__(self, text, disable, component_cfg)
1045 raise ValueError(Errors.E109.format(name=name)) from e
1046 except Exception as e:
-> 1047 error_handler(name, proc, [doc], e)
1048 if not isinstance(doc, Doc):
1049 raise ValueError(Errors.E005.format(name=name, returned_type=type(doc)))
/usr/local/lib/python3.10/dist-packages/spacy/util.py in raise_error(proc_name, proc, docs, e)
1722
1723 def raise_error(proc_name, proc, docs, e):
-> 1724 raise e
1725
1726
/usr/local/lib/python3.10/dist-packages/spacy/language.py in __call__(self, text, disable, component_cfg)
1040 error_handler = proc.get_error_handler()
1041 try:
-> 1042 doc = proc(doc, **component_cfg.get(name, {})) # type: ignore[call-arg]
1043 except KeyError as e:
1044 # This typically happens if a component is not initialized
TypeError: KeywordPosTagger.__init__() missing 2 required positional arguments: 'keywords' and 'pos_tag'
您需要通过配置字典在
add_pipe
方法中提供配置设置。在您的代码中, keyword_pos_tagger
变量是一个搁浅组件,实际上并未添加到 nlp
管道中。它共享相同的词汇,您可以使用它进行单元测试,但否则当它像这样创建时,您无法将其添加到管道中。
nlp.add_pipe("keyword_pos_tagger", config={"keywords": keywords, "pos_tag": pos_tag})