自定义 spaCy 标记器来标记字典中的所有单词

Question

我正在尝试使用 spaCy 从文本中提取特定信息。因此，我需要配置一个自定义标记器来识别它们，并配置一个自定义标记器来以 JSON 格式标记外部词典中的所有单词。

分词器进行了多次尝试，但贴标器在处理简单文本时遇到了问题。我希望添加到单词中的标签是自定义 POS 标签“UNM”，并且我可以将其归因于 token.pos_ 就像所有其他标签“NOUN”、“VERB”等。

import requests

#keywords dictionary
dictionary = requests.get(
    "https://github.com/dglopes/NBR15575/raw/main/unidades_medidas.json").json()

    
import spacy
from spacy.tokenizer import Tokenizer
from spacy.tokens import Doc
from spacy.language import Language
!python -m spacy download pt_core_news_md

#Custom Tokenizer
class NBRTokenizer(Tokenizer):
  def __init__(self, vocab):
      super().__init__(vocab)
      for unit in dicionario_unidades:
        self.add_special_case(unit, [{ORTH: unit}])

#Creating custom tagger
@Language.component("keyword_pos_tagger")
class KeywordPosTagger:
   def __init__(self, nlp, keywords, pos_tag):
       self.keywords = keywords
       self.pos_tag = pos_tag
       Doc.set_extension('pos_tag', default=None, force=True)

   def __call__(self, doc):
       for token in doc:
           if token.text in self.keywords:
               token._.pos_tag = self.pos_tag
       return doc

nlp = spacy.load('pt_core_news_md')

keywords = dictionary
pos_tag = 'UNM'
keyword_pos_tagger = KeywordPosTagger(nlp, keywords, pos_tag)

nlp.add_pipe('keyword_pos_tagger')

并使用自定义标记器：

doc = nlp('A temperatura tem 159ºC ou 20 ºC. Também precisa ter 20m de largura e 14 m² de área, caso contrário terá 1 Kelvin (W/K)')
for token in doc:
   print(token.text, token._.pos_tag)

但它返回：

   ---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-3c241e1c89fd> in <cell line: 1>()
----> 1 doc = nlp('A temperatura tem 159ºC ou 20 ºC. Também precisa ter 20m de largura e 14 m² de área, caso contrário terá 1 Kelvin (W/K)')
      2 for token in doc:
      3    print(token.text, token._.pos_tag)

2 frames
/usr/local/lib/python3.10/dist-packages/spacy/language.py in __call__(self, text, disable, component_cfg)
   1045                 raise ValueError(Errors.E109.format(name=name)) from e
   1046             except Exception as e:
-> 1047                 error_handler(name, proc, [doc], e)
   1048             if not isinstance(doc, Doc):
   1049                 raise ValueError(Errors.E005.format(name=name, returned_type=type(doc)))

/usr/local/lib/python3.10/dist-packages/spacy/util.py in raise_error(proc_name, proc, docs, e)
   1722 
   1723 def raise_error(proc_name, proc, docs, e):
-> 1724     raise e
   1725 
   1726 

/usr/local/lib/python3.10/dist-packages/spacy/language.py in __call__(self, text, disable, component_cfg)
   1040                 error_handler = proc.get_error_handler()
   1041             try:
-> 1042                 doc = proc(doc, **component_cfg.get(name, {}))  # type: ignore[call-arg]
   1043             except KeyError as e:
   1044                 # This typically happens if a component is not initialized

TypeError: KeywordPosTagger.__init__() missing 2 required positional arguments: 'keywords' and 'pos_tag'

Answer 1

您需要通过配置字典在

add_pipe

方法中提供配置设置。在您的代码中，

keyword_pos_tagger

变量是一个搁浅组件，实际上并未添加到

nlp

管道中。它共享相同的词汇，您可以使用它进行单元测试，但否则当它像这样创建时，您无法将其添加到管道中。

nlp.add_pipe("keyword_pos_tagger", config={"keywords": keywords, "pos_tag": pos_tag})

自定义 spaCy 标记器来标记字典中的所有单词

问题描述投票：0回答：1

1个回答

最新问题

自定义 spaCy 标记器来标记字典中的所有单词

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1