我有一个由斯坦福大学的 CoreNLP 系统预处理的句子语料库。它提供的功能之一是句子的解析树(基于选区)。虽然我可以在绘制解析树时理解它(就像一棵树),但我不确定如何以这种格式读取它:
例如:
(ROOT
(FRAG
(NP (NN sent28))
(: :)
(S
(NP (NNP Rome))
(VP (VBZ is)
(PP (IN in)
(NP
(NP (NNP Lazio) (NN province))
(CC and)
(NP
(NP (NNP Naples))
(PP (IN in)
(NP (NNP Campania))))))))
(. .)))
原句是:
sent28: Rome is in Lazio province and Naples in Campania .
我应该如何阅读这棵树,或者是否有一个代码(Python)可以正确执行它? 谢谢。
NLTK
有一个用于读取解析树的类:nltk.tree.Tree
。相关方法称为fromstring
。然后你可以迭代它的子树、叶子等等......
顺便说一句:您可能想删除“
sent28:
”部分,因为它会混淆解析器(它也不是句子的一部分)。您没有得到完整的解析树,而只是一个句子片段。
我知道这篇文章已经很老了,但我相信我的解决方案也可能与其他人相关。
我编写了一个名为 Constituent Treelib 的库,它提供了一种便捷的方法来将句子解析为成分树,根据其结构对其进行修改,以及将它们可视化并导出为各种文件格式。此外,人们可以根据短语类别提取短语(例如,可以用作各种 NLP 任务的特征),验证括号符号中已解析的句子或将它们转换回句子。后者是OP所要求的。以下是实现此目标的步骤:
首先,通过以下方式安装库:
pip install constituent-treelib
接下来,从库中加载相应的组件,并根据括号中的树表示形式创建给定句子的组成树:
from constituent_treelib import ConstituentTree, BracketedTree, Language
# Define the language for the sentence as well as for the spaCy and benepar models
language = Language.English
# Define which specific SpaCy model should be used (default is Medium)
spacy_model_size = ConstituentTree.SpacyModelSize.Medium
# Create the pipeline (note, the required models will be downloaded and installed automatically)
nlp = ConstituentTree.create_pipeline(language, spacy_model_size)
# Your sentence
bracketed_tree_string = """(ROOT
(FRAG
(NP (NN sent28))
(: :)
(S
(NP (NNP Rome))
(VP (VBZ is)
(PP (IN in)
(NP
(NP (NNP Lazio) (NN province))
(CC and)
(NP
(NP (NNP Naples))
(PP (IN in)
(NP (NNP Campania))))))))
(. .)))""".splitlines()
bracketed_tree_string = " ".join(bracketed_tree_string)
sentence = BracketedTree(bracketed_tree_string)
# Create the tree from where we are going to extract the desired noun phrases
tree = ConstituentTree(sentence, nlp)
最后,我们使用以下命令从构成树中恢复原始句子:
tree.leaves(tree.nltk_tree, ConstituentTree.NodeContent.Text)
结果:
'sent28 : Rome is in Lazio province and Naples in Campania .'
您可以使用斯坦福解析器,例如:
sentences = parser.raw_parse_sents(["Hello, My name is Melroy.", "What is your name?"]) #probably raw_parse(just a string) or parse_sents(list but has been splited)
for line in sentences:
for sentence in line:
***sentence.draw()***