从spacy对象中删除命名实体

Question

我正在尝试使用Spacy从文档中删除命名实体。我发现识别命名实体没有任何麻烦。使用此代码：

ne = [(ent.text, ent.label_) for ent in doc.ents]
print(ne)
persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
print(persons)

输出：

'Timothy D. Cook',
 'Peter',
 'Peter',
 'Benjamin A. Reitzes',
 'Timothy D. Cook',
 'Steve Milunovich',
 'Steven Mark Milunovich',
 'Peter',
 'Luca Maestri'

但是后来我试图使用此块将它们从文档中实际删除：

text_no_namedentities = []

ents = [e.text for e in doc.ents]
for item in doc:
    if item.text in ents:
        pass
    else:
        text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))

这不起作用，因为NE是n-gram。如果我仅检查一点spacy对象的内容，则如下所示：

for item in doc:
    print(item.text)

iPad
has
a
78
%
Steve
Milunovich
share
of
the
U.S.
commercial
tablet
market

因此，spacy对象已被标记化。因此，我无法使用上面的代码删除网元。关于如何从对象中删除所有命名实体的任何想法？

Answer 1

您要检查的条件是

if item.ent_type:

如果True（“令牌”）是命名实体的一部分，则其值为item。 token.ent_type将是实体实际类型的哈希ID，您可以使用token.ent_type_（请注意_）进行查询。

这是我要使用的代码：

    text_no_namedentities = ""
    for token in doc:
        if not token.ent_type:
            text_no_namedentities += token.text
            if token.whitespace_:
                text_no_namedentities += " "

请注意，您可以使用token.whitespace_确定原始句子中的原始标记后是否有空格。

更多信息，请参见Token here上的文档。

FYI-为了将来，将工作代码的最小片段而不只是其中的一部分包含在内会更加方便。

Answer 2

您可以使用spacy函数和列表推导将文档转换为字符串列表，然后再转换为文档：

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp('John and Jim are my favorite Google employees.')
persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
doc_anon = nlp(" ".join([d for d in doc.text.split() if d not in persons]))
print(doc_anon)

这将为您提供一个伪文档对象。

and are my favorite Google employees.

从spacy对象中删除命名实体

问题描述投票：0回答：2

2个回答

最新问题

从spacy对象中删除命名实体

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2