Spacy Entity Rule不适用于基数(社会安全号码)

问题描述 投票:0回答:1

我已经使用实体规则为社会保险号添加新标签。我什至设置了overwrite_ents = true,但仍然无法识别

我确认正则表达式正确。不知道我还需要做什么我尝试过before =“ ner”,但结果相同

text = "My name is yuyyvb and I leave on 605 W Clinton Street. My social security 690-96-4032"
nlp = spacy.load("en_core_web_sm")
ruler = EntityRuler(nlp, overwrite_ents=True)
ruler.add_patterns([{"label": "SSN", "pattern": [{"TEXT": {"REGEX": r"\d{3}[^\w]\d{2}[^\w]\d{4}"}}]}])
nlp.add_pipe(ruler)
doc  = nlp(text)
for ent in doc.ents:
    print("{} {}".format(ent.text, ent.label_))
python-3.x spacy named-entity-recognition
1个回答
0
投票

实际上,您拥有的SSN会通过伪造分为5个块:

print([token.text for token in nlp("690-96-4032")])
# => ['690', '-', '96', '-', '4032']

因此,请使用不将数字之间的-分开作为单独的令牌的自定义令牌生成器,或者-更简单-为连续的5个令牌创建模式:

patterns = [{"label": "SSN", "pattern": [{"TEXT": {"REGEX": r"^\d{3}$"}}, {"TEXT": "-"}, {"TEXT": {"REGEX": r"^\d{2}$"}}, {"TEXT": "-"}, {"TEXT": {"REGEX": r"^\d{4}$"}} ]}]

完整Spacy演示:

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load("en_core_web_sm")
ruler = EntityRuler(nlp, overwrite_ents=True)
patterns = [{"label": "SSN", "pattern": [{"TEXT": {"REGEX": r"^\d{3}$"}}, {"TEXT": "-"}, {"TEXT": {"REGEX": r"^\d{2}$"}}, {"TEXT": "-"}, {"TEXT": {"REGEX": r"^\d{4}$"}} ]}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

text = "My name is yuyyvb and I leave on 605 W Clinton Street. My social security 690-96-4032"
doc = nlp(text)
print([(ent.text, ent.label_) for ent in doc.ents])
# => [('605', 'CARDINAL'), ('690-96-4032', 'SSN')]

因此,{"TEXT": {"REGEX": r"^\d{3}$"}}匹配仅包含三个数字的令牌,{"TEXT": "-"}-字符,依此类推。>>

© www.soinside.com 2019 - 2024. All rights reserved.