我正在尝试编写一种模式,根据一个子字符串将整个单词标记为单元。 例子如下:
terms = [{'ent': "UNIT",
'patterns':[
[{'lemma':'liter'}]]}]
text = "There were 46 kiloliters of juice available"
我想根据这种模式将“千升”标记为单位。我尝试使用“引理”,但在这种情况下它不起作用。
你还没说你用的是哪种型号,所以我就用
en_web_core_sm
:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
doc = nlp("There were 46 kiloliters of juice available")
首先,这些都没有
ent_type
为 UNIT
:
for tok in doc:
print(f"'{tok}': ent_type: '{tok.ent_type_}', lemma: '{tok.lemma_}'")
'There': ent_type: '', lemma: 'there'
'were': ent_type: '', lemma: 'be'
'46': ent_type: 'CARDINAL', lemma: '46'
'kiloliters': ent_type: '', lemma: 'kiloliter'
'of': ent_type: '', lemma: 'of'
'juice': ent_type: '', lemma: 'juice'
'available': ent_type: '', lemma: 'available'
此外,如您所见,
kiloliters
的引理是kiloliter
。这有点烦人,因为您不想单独指定毫升、升等。一种替代方法是在正则表达式后面查找 CARDINAL
标记(其中还包括单词,例如 "two liters"
):
doc = nlp("""
There were 46 kiloliters of juice available.
I could not drink more than two liters a day.
I would only give a child 500 milliliters.
"""
)
pattern = [{'ENT_TYPE': 'CARDINAL'},
{"TEXT": {"REGEX": "^.*(liter)s?$"}}]
matcher.add("unit", [pattern])
matches = matcher(doc, as_spans=True)
for span in matches:
print(span[-1].text)
输出:
kiloliters
liters
milliliters