我正在使用Python软件包ete3
。我有以下树木:
((Species1_order1,(Species2_order2,Species3_order2)),Species4_order3,Species5_order5);
我想看到与树中特定节点(这里的树为Species1_order1
)关系最密切的叶。在此示例中,最紧密相关的叶子是Species2_order2
/ Species3_order2
和Species4_order3
/ Species5_order5
。
代码:
tree = ete3.Tree('((Species1_order1, \
(Species2_order2, Species3_order2)), \
Species4_order3, Species5_order5);')
新示例:
tree=ete3.Tree('((((((A,B),C),D),(E,F)),G),(H,I));')
我得到的结果是:
A B C D E F G H I
A 0.0 2.0 3.0 4.0 6.0 6.0 6.0 8.0 8.0
B 2.0 0.0 3.0 4.0 6.0 6.0 6.0 8.0 8.0
C 3.0 3.0 0.0 3.0 5.0 5.0 5.0 7.0 7.0
D 4.0 4.0 3.0 0.0 4.0 4.0 4.0 6.0 6.0
E 6.0 6.0 5.0 4.0 0.0 2.0 4.0 6.0 6.0
F 6.0 6.0 5.0 4.0 2.0 0.0 4.0 6.0 6.0
G 6.0 6.0 5.0 4.0 4.0 4.0 0.0 4.0 4.0
H 8.0 8.0 7.0 6.0 6.0 6.0 4.0 0.0 2.0
I 8.0 8.0 7.0 6.0 6.0 6.0 4.0 2.0 0.0
但是例如,树中的E和F与A,B,C和D的距离相等,因此它们似乎比D更衣。
一个好的矩阵结果应该是:
A B C D E F G H I
A 0 1 2 3 4 4 5 6 6
B 1 0 2 3 4 4 5 6 6
C 2 2 0 3 4 4 5 6 6
D 3 3 3 0 4 4 5 6 6
E 4 4 4 4 0 1 5 6 6
F 4 4 4 4 1 0 5 6 6
G 5 5 5 5 5 5 0 6 6
H 6 6 6 6 6 6 6 0 1
I 6 6 6 6 6 6 6 1 0
不是吗?
如评论中所讨论,ete3
给我们提供了一个名为Tree.get_closest_leaf
的函数,但它的输出不是预期的(而且我不确定该值在这里代表什么):
>>> t=ete3.Tree('((Species1_order1,(Species2_order2,Species3_order2)),Species4_order3,Species5_order5);')
>>> t.get_closest_leaf('Species2_order2')
(Tree node 'Species4_order3' (0x115b2f29), 0.0)
相反,您可以像这样获得节点距离:
import ete3
import pandas as pd
def make_matrix(tree):
def get_root_path(node):
root_path = [node]
if node.up:
root_path.extend(get_root_path(node.up))
return root_path
leaves = tree.get_leaves()
leaf_ct = len(leaves)
paths = {node.name: set(get_root_path(node)) for node in leaves}
col_lbls = [leaf.name for leaf in leaves]
dist_matrix = pd.np.array([pd.np.zeros(leaf_ct)] * leaf_ct)
df = pd.DataFrame(dist_matrix, index=col_lbls, columns=col_lbls)
for node1_name, col in df.iteritems():
for node2_name in col.keys():
path = paths[node2_name].symmetric_difference(paths[node1_name])
dist = sum(node.dist for node in path)
df.at[node1_name, node2_name] = dist
df.at[node2_name, node1_name] = dist
return df
注意:由于种种原因,这是次优的解决方案,但是这个问题并不是在寻求最有效的解决方案。有关系统发育距离矩阵方法的更多信息,请参见this link。
此解决方案还使用了pandas
,这实在太过分了,因为它实际上只是为了方便行/列标签。删除pandas
依赖性并使用本机列表代替它并不困难。
这里是输出:
>>> tree=ete3.Tree('((Species1_order1, (Species2_order2, Species3_order2)), Species4_order3, Species5_order5);')
>>> make_matrix(tree)
Species1_order1 Species2_order2 Species3_order2 Species4_order3 Species5_order5
Species1_order1 0.0 3.0 3.0 3.0 3.0
Species2_order2 3.0 0.0 2.0 4.0 4.0
Species3_order2 3.0 2.0 0.0 4.0 4.0
Species4_order3 3.0 4.0 4.0 0.0 2.0
Species5_order5 3.0 4.0 4.0 2.0 0.0
对于发布的更新,我没有发现任何错误。它似乎给出正确的结果。这是ete3渲染的树:
这里是Interest_sequence
对应的矩阵列:
>>> m['Interest_sequence']
Rhopalosiphum_maidis__Hemiptera 4.0
Drosophila_novamexicana__Hemiptera 5.0
Drosophila_arizonae__Hemiptera 6.0
Drosophila_navojoa__Hemiptera 6.0
Interest_sequence 0.0
Heliothis_virescens_droso_3a__nan 5.0
Mythimna_separata_droso__nan 6.0
Heliothis_virescens_droso_3i__nan 6.0
Scaptodrosophila_lebanonensis__Diptera 5.0
Mythimna_unipuncta_droso_A__nan 6.0
Xestia_c-nigrum_droso__nan 8.0
Helicoverpa_armigera_droso__nan 8.0
Mocis_latipes_droso__nan 7.0
Drosophila_busckii__Diptera 4.0
Drosophila_bipectinata__Diptera 5.0
Drosophila_mojavensis__Diptera 7.0
Drosophila_yakuba__Diptera 7.0
Drosophila_hydei__Diptera 7.0
Drosophila_serrata__Diptera 8.0
Drosophila_takahashii__Diptera 9.0
Drosophila_eugracilis__Diptera 11.0
Drosophila_ficusphila__Diptera 11.0
Drosophila_erecta__Diptera 12.0
Drosophila_melanogaster__Diptera 13.0
Sequence_A_nan__nan 14.0
Drosophila_sechellia__Diptera 15.0
Drosophila_simulans__Diptera 15.0
Drosophila_suzukii__Diptera 12.0
Drosophila_biarmipes__Diptera 12.0
Name: Interest_sequence, dtype: float64