Python Spacy Pattern - 如何根据另一个单词标记一个单词？

Question

我正在尝试编写一种模式，根据一个子字符串将整个单词标记为单元。例子如下：

terms = [{'ent': "UNIT",
         'patterns':[
            [{'lemma':'liter'}]]}]

text = "There were 46 kiloliters of juice available"

我想根据这种模式将“千升”标记为单位。我尝试使用“引理”，但在这种情况下它不起作用。

Answer 1

你还没说你用的是哪种型号，所以我就用

en_web_core_sm

:

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
doc = nlp("There were 46 kiloliters of juice available")

首先，这些都没有

ent_type

为

UNIT

：

for tok in doc:
    print(f"'{tok}': ent_type: '{tok.ent_type_}', lemma: '{tok.lemma_}'")

'There': ent_type: '', lemma: 'there'
'were': ent_type: '', lemma: 'be'
'46': ent_type: 'CARDINAL', lemma: '46'
'kiloliters': ent_type: '', lemma: 'kiloliter'
'of': ent_type: '', lemma: 'of'
'juice': ent_type: '', lemma: 'juice'
'available': ent_type: '', lemma: 'available'

此外，如您所见，

kiloliters

的引理是

kiloliter

。这有点烦人，因为您不想单独指定毫升、升等。一种替代方法是在正则表达式后面查找

CARDINAL

标记（其中还包括单词，例如

"two liters"

）：

doc = nlp("""
          There were 46 kiloliters of juice available.
          I could not drink more than two liters a day.
          I would only give a child 500 milliliters.
          """
)
pattern = [{'ENT_TYPE': 'CARDINAL'},
           {"TEXT": {"REGEX": "^.*(liter)s?$"}}]

matcher.add("unit", [pattern])

matches = matcher(doc, as_spans=True)
for span in matches:
    print(span[-1].text)

输出：

kiloliters
liters
milliliters

Python Spacy Pattern - 如何根据另一个单词标记一个单词？

问题描述投票：0回答：1

1个回答

最新问题

Python Spacy Pattern - 如何根据另一个单词标记一个单词？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1