如何在Spacy中获取所有名词短语

Question

Spacy我是新手，我想从句子中提取“所有”名词短语。我想知道我该怎么做。我有以下代码：

import spacy

nlp = spacy.load("en")

file = open("E:/test.txt", "r")
doc = nlp(file.read())
for np in doc.noun_chunks:
    print(np.text)

但是它仅返回基本名词短语，即其中不包含任何其他NP的短语。也就是说，对于以下短语，我得到以下结果：

词组：We try to explicitly describe the geometry of the edges of the images.

结果：We, the geometry, the edges, the images。

预期结果：We, the geometry, the edges, the images, the geometry of the edges of the images, the edges of the images.

如何获得所有名词短语，包括嵌套短语？

Answer 1

请参阅下面的注释代码以递归方式组合名词。受Spacy Docs here启发的代码>

import spacy

nlp = spacy.load("en")

doc = nlp("We try to explicitly describe the geometry of the edges of the images.")

for np in doc.noun_chunks: # use np instead of np.text
    print(np)

print()

# code to recursively combine nouns
# 'We' is actually a pronoun but included in your question
# hence the token.pos_ == "PRON" part in the last if statement
# suggest you extract PRON separately like the noun-chunks above

index = 0
nounIndices = []
for token in doc:
    # print(token.text, token.pos_, token.dep_, token.head.text)
    if token.pos_ == 'NOUN':
        nounIndices.append(index)
    index = index + 1


print(nounIndices)
for idxValue in nounIndices:
    doc = nlp("We try to explicitly describe the geometry of the edges of the images.")
    span = doc[doc[idxValue].left_edge.i : doc[idxValue].right_edge.i+1]
    span.merge()

    for token in doc:
        if token.dep_ == 'dobj' or token.dep_ == 'pobj' or token.pos_ == "PRON":
            print(token.text)

Answer 2

对于每个名词块，您还可以在其下找到子树。Spacy提供了两种访问方法：left_edge和right edge属性以及subtree属性，该属性返回Token迭代器而不是跨度。组合noun_chunks和它们的子树会导致某些重复，以后可以删除。

如何在Spacy中获取所有名词短语

问题描述投票：5回答：2

2个回答

最新问题

如何在Spacy中获取所有名词短语

问题描述 投票：5回答：2

2个回答

最新问题

问题描述投票：5回答：2