我已经使用实体规则为社会保险号添加新标签。我什至设置了overwrite_ents = true,但仍然无法识别
我确认正则表达式正确。不知道我还需要做什么我尝试过before =“ ner”,但结果相同
text = "My name is yuyyvb and I leave on 605 W Clinton Street. My social security 690-96-4032"
nlp = spacy.load("en_core_web_sm")
ruler = EntityRuler(nlp, overwrite_ents=True)
ruler.add_patterns([{"label": "SSN", "pattern": [{"TEXT": {"REGEX": r"\d{3}[^\w]\d{2}[^\w]\d{4}"}}]}])
nlp.add_pipe(ruler)
doc = nlp(text)
for ent in doc.ents:
print("{} {}".format(ent.text, ent.label_))
实际上,您拥有的SSN会通过伪造分为5个块:
print([token.text for token in nlp("690-96-4032")])
# => ['690', '-', '96', '-', '4032']
因此,请使用不将数字之间的-
分开作为单独的令牌的自定义令牌生成器,或者-更简单-为连续的5个令牌创建模式:
patterns = [{"label": "SSN", "pattern": [{"TEXT": {"REGEX": r"^\d{3}$"}}, {"TEXT": "-"}, {"TEXT": {"REGEX": r"^\d{2}$"}}, {"TEXT": "-"}, {"TEXT": {"REGEX": r"^\d{4}$"}} ]}]
完整Spacy演示:
import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load("en_core_web_sm")
ruler = EntityRuler(nlp, overwrite_ents=True)
patterns = [{"label": "SSN", "pattern": [{"TEXT": {"REGEX": r"^\d{3}$"}}, {"TEXT": "-"}, {"TEXT": {"REGEX": r"^\d{2}$"}}, {"TEXT": "-"}, {"TEXT": {"REGEX": r"^\d{4}$"}} ]}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
text = "My name is yuyyvb and I leave on 605 W Clinton Street. My social security 690-96-4032"
doc = nlp(text)
print([(ent.text, ent.label_) for ent in doc.ents])
# => [('605', 'CARDINAL'), ('690-96-4032', 'SSN')]
因此,{"TEXT": {"REGEX": r"^\d{3}$"}}
匹配仅包含三个数字的令牌,{"TEXT": "-"}
是-
字符,依此类推。>>