Python Spacy Pattern - 如何根据另一个单词标记一个单词?

问题描述 投票:0回答:1

我正在尝试编写一种模式,根据一个子字符串将整个单词标记为单元。 例子如下:

terms = [{'ent': "UNIT",
         'patterns':[
            [{'lemma':'liter'}]]}]

text = "There were 46 kiloliters of juice available"

我想根据这种模式将“千升”标记为单位。我尝试使用“引理”,但在这种情况下它不起作用。

python nlp spacy
1个回答
0
投票

你还没说你用的是哪种型号,所以我就用

en_web_core_sm
:

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
doc = nlp("There were 46 kiloliters of juice available")

首先,这些都没有

ent_type
UNIT

for tok in doc:
    print(f"'{tok}': ent_type: '{tok.ent_type_}', lemma: '{tok.lemma_}'")

'There': ent_type: '', lemma: 'there'
'were': ent_type: '', lemma: 'be'
'46': ent_type: 'CARDINAL', lemma: '46'
'kiloliters': ent_type: '', lemma: 'kiloliter'
'of': ent_type: '', lemma: 'of'
'juice': ent_type: '', lemma: 'juice'
'available': ent_type: '', lemma: 'available'

此外,如您所见,

kiloliters
的引理是
kiloliter
。这有点烦人,因为您不想单独指定毫升、升等。一种替代方法是在正则表达式后面查找
CARDINAL
标记(其中还包括单词,例如
"two liters"
):

doc = nlp("""
          There were 46 kiloliters of juice available.
          I could not drink more than two liters a day.
          I would only give a child 500 milliliters.
          """
)
pattern = [{'ENT_TYPE': 'CARDINAL'},
           {"TEXT": {"REGEX": "^.*(liter)s?$"}}]

matcher.add("unit", [pattern])

matches = matcher(doc, as_spans=True)
for span in matches:
    print(span[-1].text)

输出:

kiloliters
liters
milliliters
© www.soinside.com 2019 - 2024. All rights reserved.