我正在将NLP与python配合使用,以从字符串中查找名称。如果我有全名(名字和姓氏),我可以找到,但是在字符串中我只有名字意味着我的代码无法识别为Person。下面是我的代码。
import re
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
string = """
Sriram is working as a python developer
"""
def ie_preprocess(document):
document = ' '.join([i for i in document.split() if i not in stop])
sentences = nltk.sent_tokenize(document)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
return sentences
def extract_names(document):
names = []
sentences = ie_preprocess(document)
#print(sentences)
for tagged_sentence in sentences:
for chunk in nltk.ne_chunk(tagged_sentence):
#print("Out Side ",chunk)
if type(chunk) == nltk.tree.Tree:
if chunk.label() == 'PERSON':
print("In Side ",chunk)
names.append(' '.join([c[0] for c in chunk]))
return names
if __name__ == '__main__':
names = extract_names(string)
print(names)
我的建议是使用StanfordNLP / Spacy NER,使用nltk ne块有点麻烦。研究人员更常使用StanfordNLP,但Spacy更易于使用。这是一个使用Spacy打印每个命名实体的名称及其类型的示例:
>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> text = 'Sriram is working as a python developer'
>>> doc = nlp(text)
>>> for ent in doc.ents:
print(ent.text,ent.label_)
Sriram ORG
>>>
[注意,它将Sriram归类为组织,这可能是因为它不是通用的英文名称,并且Spacy接受了英语corpa的培训。祝你好运!