我有一个名为 Network 的 pandas 数据库,其网络结构如下:
{'Sup': {0: 1002000157,
1: 1002000157,
2: 1002000157,
3: 1002000157,
4: 1002000157,
5: 1002000157,
6: 1002000157,
7: 1002000157,
8: 1002000157,
9: 1002000157,
10: 1002000157,
11: 1002000157,
12: 1002000157,
13: 1002000382,
14: 1002000382,
15: 1002000382,
16: 1002000382,
17: 1002000382,
18: 1002000382,
19: 1002000382,
20: 1002000382,
21: 1002000382,
22: 1002000382,
23: 1002000382,
24: 1002000382,
25: 1002000382,
26: 1002000382,
27: 1002000382,
28: 1002000382,
29: 1002000382},
'Cust': {0: 1002438313,
1: 8039296054,
2: 9003188096,
3: 14900070991,
4: 17005234747,
5: 18006860724,
6: 28000286091,
7: 29009623382,
8: 39000007702,
9: 39004420023,
10: 46000088397,
11: 50000063751,
12: 7000090017,
13: 1900120936,
14: 1900779883,
15: 2000013994,
16: 2001222824,
17: 2003032125,
18: 2900121723,
19: 2900197555,
20: 2902742641,
21: 3000101113,
22: 3000195031,
23: 3000318054,
24: 3900091301,
25: 3911084436,
26: 4900112325,
27: 5900720933,
28: 7000001703,
29: 8000004881}}
我想在 python 中重现 R 的这个命令(可能没有内核中断):
NodesSharingSupplier <- inner_join(Network, Network, by=c('Sup'='Sup'))
现在,如果我是正确的,这是一种内部连接 SQL 风格,因此担心它不能简单地通过 python 中 Sup 的内部合并来执行。
你能帮我弄清楚如何在 python 中重现它吗?
merge
:
NodesSharingSupplier = Network.merge(Network, on='Sup', how='inner')
print(NodesSharingSupplier)
# Output
Sup Cust_x Cust_y
0 1002000157 1002438313 1002438313
1 1002000157 1002438313 8039296054
2 1002000157 1002438313 9003188096
3 1002000157 1002438313 14900070991
4 1002000157 1002438313 17005234747
.. ... ... ...
453 1002000382 8000004881 3911084436
454 1002000382 8000004881 4900112325
455 1002000382 8000004881 5900720933
456 1002000382 8000004881 7000001703
457 1002000382 8000004881 8000004881
[458 rows x 3 columns]
您可以通过在
Cust_x == Cust_y
之后附加.query('Cust_x != Cust_y')
来删除.merge(...)
的大小写。
输入:
data = {'Sup': {0: 1002000157, 1: 1002000157, 2: 1002000157, 3: 1002000157, 4: 1002000157, 5: 1002000157, 6: 1002000157, 7: 1002000157, 8: 1002000157, 9: 1002000157, 10: 1002000157, 11: 1002000157, 12: 1002000157, 13: 1002000382, 14: 1002000382, 15: 1002000382, 16: 1002000382, 17: 1002000382, 18: 1002000382, 19: 1002000382, 20: 1002000382, 21: 1002000382, 22: 1002000382, 23: 1002000382, 24: 1002000382, 25: 1002000382, 26: 1002000382, 27: 1002000382, 28: 1002000382, 29: 1002000382},
'Cust': {0: 1002438313, 1: 8039296054, 2: 9003188096, 3: 14900070991, 4: 17005234747, 5: 18006860724, 6: 28000286091, 7: 29009623382, 8: 39000007702, 9: 39004420023, 10: 46000088397, 11: 50000063751, 12: 7000090017, 13: 1900120936, 14: 1900779883, 15: 2000013994, 16: 2001222824, 17: 2003032125, 18: 2900121723, 19: 2900197555, 20: 2902742641, 21: 3000101113, 22: 3000195031, 23: 3000318054, 24: 3900091301, 25: 3911084436, 26: 4900112325, 27: 5900720933, 28: 7000001703, 29: 8000004881}}
Network = pd.DataFrame(data)
更多信息:Pandas Merging 101
内部联接:
merge(df1, df2)
将适用于这些示例,因为 R 会自动按公共变量名称联接框架,但您很可能希望指定 merge(df1, df2, by = "CustomerId")
以确保您只匹配所需的字段。如果匹配变量在不同的数据框中具有不同的名称,您还可以使用 by.x
和 by.y
参数。