Group Pandas:如何合并两个具有相同索引但扩展名不同的Ps的csv文件

问题描述 投票:0回答:1

我想合并或合并两个具有相同索引ID但相同ID具有不同扩展名的csv文件。数据也按ID分组。第一个文件如下所示:

ID,year,age
810006862,2000,49
810006862,2001,
810006862,2002,
810006862,2003,52
810023112,2003,27
810023112,2004,28
810023112,2005,29
810023112,2006,30
810033622,2000,24
810033622,2001,25

第二个文件如下所示:

ID,year,from,to
810006862,2002,15341,15705
810006862,2003,15706,16070
810006862,2004,16071,16436
810006862,2005,,
810023112,2000,14610,14975
810023112,2001,14976,15340
810023112,2003,15706,16070
810033622,2000,14610,14975
810033622,2001,14976,15340

我将两个文件的ID索引读入数据帧后都设置了索引,然后将它们连接在一起,但是收到错误消息“ ValueError:传递的值的形状为(25,2914),索引暗示为(25,251) “

我尝试了以下代码:

sp = pd.read_csv('sp1.csv')
sp = sp.set_index('ID')
op = pd.read_csv('op1.csv')
op = op.set_index('ID')
ff = pd.concat([sp, op], join = 'outer', sort = False, axis = 1)

我也尝试过将两个文件并置在一起而未设置索引,结果似乎具有正确的行,但是水平值相关不正确。我也尝试过合并,但是每个组中都有许多不必要的重复行。由于每个组都有不同的年龄和年龄,因此我发现使用这种方法删除这些新生成的行非常困难。

full = pd.merge(sp, op, on = 'ID', how = 'outer', sort = False)

也许有人可以建议一些方法来轻松删除这些重复项,这对我也将有效,因为合并后的文件变得如此巨大!预先感谢!

预期结果将包括两个csv文件中的所有不同值。有点像这样:

ID,year,age,from,to
810006862,2000,49,15341,15705
810006862,2001,,15706,16070
810006862,2002,,16071,16436
810006862,2003,52,,
810006862,2004,,,
810006862,2005,,,
810023112,2000,,14610,14975
810023112,2001,,14976,15340
810023112,2003,27,15706,16070
810023112,2004,28,,
810023112,2005,29,,
810023112,2006,30,,
810033622,2000,24,14610,14975
810033622,2001,25,14976,15340

任何人都可以提供任何线索来做到这一点吗?非常感谢!

pandas group-concat
1个回答
-1
投票

如果需要通过重复的ID值合并,请添加辅助计数器:

sp['group'] = sp.groupby('ID').cumcount()
op['group'] = op.groupby('ID').cumcount()
full = pd.merge(sp, op.drop('year', 1), 
                on = ['ID', 'group'], 
                how = 'outer', sort = False).drop('group', axis=1)
print (full)
          ID  year   age     from       to
0  810006862  2000  49.0  15341.0  15705.0
1  810006862  2001   NaN  15706.0  16070.0
2  810006862  2002   NaN  16071.0  16436.0
3  810006862  2003  52.0      NaN      NaN
4  810023112  2003  27.0  14610.0  14975.0
5  810023112  2004  28.0  14976.0  15340.0
6  810023112  2005  29.0  15706.0  16070.0
7  810023112  2006  30.0      NaN      NaN
8  810033622  2000  24.0  14610.0  14975.0
9  810033622  2001  25.0  14976.0  15340.0

但是如果需要也可以按years进行匹配:

full = pd.merge(sp, op, on = ['ID', 'year'], how = 'outer', sort = False)

print (full)
           ID  year   age     from       to
0   810006862  2000  49.0      NaN      NaN
1   810006862  2001   NaN      NaN      NaN
2   810006862  2002   NaN  15341.0  15705.0
3   810006862  2003  52.0  15706.0  16070.0
4   810023112  2003  27.0  15706.0  16070.0
5   810023112  2004  28.0      NaN      NaN
6   810023112  2005  29.0      NaN      NaN
7   810023112  2006  30.0      NaN      NaN
8   810033622  2000  24.0  14610.0  14975.0
9   810033622  2001  25.0  14976.0  15340.0
10  810006862  2004   NaN  16071.0  16436.0
11  810006862  2005   NaN      NaN      NaN
12  810023112  2000   NaN  14610.0  14975.0
13  810023112  2001   NaN  14976.0  15340.0
© www.soinside.com 2019 - 2024. All rights reserved.