我试图从依赖解析器的输出中创建一个树(嵌套字典)。这句话是“我在睡梦中拍了一头大象”。我能够获得链接上描述的输出:How do I do dependency parsing in NLTK?
nsubj(shot-2, I-1)
det(elephant-4, an-3)
dobj(shot-2, elephant-4)
prep(shot-2, in-5)
poss(sleep-7, my-6)
pobj(in-5, sleep-7)
要将此元组列表转换为嵌套字典,我使用以下链接:How to convert python list of tuples into tree?
def build_tree(list_of_tuples):
all_nodes = {n[2]:((n[0], n[1]),{}) for n in list_of_tuples}
root = {}
print all_nodes
for item in list_of_tuples:
rel, gov,dep = item
if gov is not 'ROOT':
all_nodes[gov][1][dep] = all_nodes[dep]
else:
root[dep] = all_nodes[dep]
return root
这给出了如下输出:
{'shot': (('ROOT', 'ROOT'),
{'I': (('nsubj', 'shot'), {}),
'elephant': (('dobj', 'shot'), {'an': (('det', 'elephant'), {})}),
'sleep': (('nmod', 'shot'),
{'in': (('case', 'sleep'), {}), 'my': (('nmod:poss', 'sleep'), {})})})}
为了找到root到leaf路径,我使用了以下链接:Return root to specific leaf from a nested dictionary tree
[制作树并找到路径是两个不同的东西]第二个目标是找到根到叶节点路径,如完成Return root to specific leaf from a nested dictionary tree。但我想得到root-to-leaf(依赖关系路径)所以,例如,当我将调用recurse_category(categories,'an')时,其中categories是嵌套的树结构,'an'是树中的单词,我应该得到ROOT-nsubj-dobj
(依赖关系直到root)作为输出。
这会将输出转换为嵌套字典表单。如果我能找到路径,我会告诉你更新。也许这个,很有帮助。
list_of_tuples = [('ROOT','ROOT', 'shot'),('nsubj','shot', 'I'),('det','elephant', 'an'),('dobj','shot', 'elephant'),('case','sleep', 'in'),('nmod:poss','sleep', 'my'),('nmod','shot', 'sleep')]
nodes={}
for i in list_of_tuples:
rel,parent,child=i
nodes[child]={'Name':child,'Relationship':rel}
forest=[]
for i in list_of_tuples:
rel,parent,child=i
node=nodes[child]
if parent=='ROOT':# this should be the Root Node
forest.append(node)
else:
parent=nodes[parent]
if not 'children' in parent:
parent['children']=[]
children=parent['children']
children.append(node)
print forest
输出是嵌套字典,
[{'Name': 'shot', 'Relationship': 'ROOT',
'children':
[{'Name': 'I', 'Relationship': 'nsubj'},
{'Name': 'elephant', 'Relationship':
'dobj',
'children':
[{'Name': 'an',
'Relationship': 'det'}]},
{'Name': 'sleep', 'Relationship':
'nmod',
'children':
[{'Name': 'in',
'Relationship': 'case'},
{'Name': 'my', 'Relationship':
'nmod:poss'}]}]}]
以下函数可以帮助您找到root-to-leaf路径:
def recurse_category(categories,to_find):
for category in categories:
if category['Name'] == to_find:
return True, [category['Relationship']]
if 'children' in category:
found, path = recurse_category(category['children'], to_find)
if found:
return True, [category['Relationship']] + path
return False, []
首先,如果您只是使用预先训练的Stanford CoreNLP依赖解析器模型,您应该使用CoreNLPDependencyParser
中的nltk.parse.corenlp
并避免使用旧的nltk.parse.stanford
接口。
在终端中下载并运行Java服务器后,在Python中:
>>> from nltk.parse.corenlp import CoreNLPDependencyParser
>>> dep_parser = CoreNLPDependencyParser(url='http://localhost:9000')
>>> sent = "I shot an elephant with a banana .".split()
>>> parses = list(dep_parser.parse(sent))
>>> type(parses[0])
<class 'nltk.parse.dependencygraph.DependencyGraph'>
现在我们看到解析是来自DependencyGraph
nltk.parse.dependencygraph
的https://github.com/nltk/nltk/blob/develop/nltk/parse/dependencygraph.py#L36类型
要通过简单地执行DependencyGraph
将nltk.tree.Tree
转换为DependencyGraph.tree()
对象:
>>> parses[0].tree()
Tree('shot', ['I', Tree('elephant', ['an']), Tree('banana', ['with', 'a']), '.'])
>>> parses[0].tree().pretty_print()
shot
_________|____________
| | elephant banana
| | | _____|_____
I . an with a
要将其转换为括号内的解析格式:
>>> print(parses[0].tree())
(shot I (elephant an) (banana with a) .)
如果你正在寻找依赖三胞胎:
>>> [(governor, dep, dependent) for governor, dep, dependent in parses[0].triples()]
[(('shot', 'VBD'), 'nsubj', ('I', 'PRP')), (('shot', 'VBD'), 'dobj', ('elephant', 'NN')), (('elephant', 'NN'), 'det', ('an', 'DT')), (('shot', 'VBD'), 'nmod', ('banana', 'NN')), (('banana', 'NN'), 'case', ('with', 'IN')), (('banana', 'NN'), 'det', ('a', 'DT')), (('shot', 'VBD'), 'punct', ('.', '.'))]
>>> for governor, dep, dependent in parses[0].triples():
... print(governor, dep, dependent)
...
('shot', 'VBD') nsubj ('I', 'PRP')
('shot', 'VBD') dobj ('elephant', 'NN')
('elephant', 'NN') det ('an', 'DT')
('shot', 'VBD') nmod ('banana', 'NN')
('banana', 'NN') case ('with', 'IN')
('banana', 'NN') det ('a', 'DT')
('shot', 'VBD') punct ('.', '.')
以CONLL格式:
>>> print(parses[0].to_conll(style=10))
1 I I PRP PRP _ 2 nsubj _ _
2 shot shoot VBD VBD _ 0 ROOT _ _
3 an a DT DT _ 4 det _ _
4 elephant elephant NN NN _ 2 dobj _ _
5 with with IN IN _ 7 case _ _
6 a a DT DT _ 7 det _ _
7 banana banana NN NN _ 2 nmod _ _
8 . . . . _ 2 punct _ _