我想合并或合并两个具有相同索引ID但相同ID具有不同扩展名的csv文件。数据也按ID分组。第一个文件如下所示:
ID,year,age
810006862,2000,49
810006862,2001,
810006862,2002,
810006862,2003,52
810023112,2003,27
810023112,2004,28
810023112,2005,29
810023112,2006,30
810033622,2000,24
810033622,2001,25
第二个文件如下所示:
ID,year,from,to
810006862,2002,15341,15705
810006862,2003,15706,16070
810006862,2004,16071,16436
810006862,2005,,
810023112,2000,14610,14975
810023112,2001,14976,15340
810023112,2003,15706,16070
810033622,2000,14610,14975
810033622,2001,14976,15340
我将两个文件的ID索引读入数据帧后都设置了索引,然后将它们连接在一起,但是收到错误消息“ ValueError:传递的值的形状为(25,2914),索引暗示为(25,251) “
我尝试了以下代码:
sp = pd.read_csv('sp1.csv')
sp = sp.set_index('ID')
op = pd.read_csv('op1.csv')
op = op.set_index('ID')
ff = pd.concat([sp, op], join = 'outer', sort = False, axis = 1)
我也尝试过将两个文件并置在一起而未设置索引,结果似乎具有正确的行,但是水平值相关不正确。我也尝试过合并,但是每个组中都有许多不必要的重复行。由于每个组都有不同的年龄和年龄,因此我发现使用这种方法删除这些新生成的行非常困难。
full = pd.merge(sp, op, on = 'ID', how = 'outer', sort = False)
也许有人可以建议一些方法来轻松删除这些重复项,这对我也将有效,因为合并后的文件变得如此巨大!预先感谢!
预期结果将包括两个csv文件中的所有不同值。有点像这样:
ID,year,age,from,to
810006862,2000,49,15341,15705
810006862,2001,,15706,16070
810006862,2002,,16071,16436
810006862,2003,52,,
810006862,2004,,,
810006862,2005,,,
810023112,2000,,14610,14975
810023112,2001,,14976,15340
810023112,2003,27,15706,16070
810023112,2004,28,,
810023112,2005,29,,
810023112,2006,30,,
810033622,2000,24,14610,14975
810033622,2001,25,14976,15340
任何人都可以提供任何线索来做到这一点吗?非常感谢!
如果需要通过重复的ID
值合并,请添加辅助计数器:
sp['group'] = sp.groupby('ID').cumcount()
op['group'] = op.groupby('ID').cumcount()
full = pd.merge(sp, op.drop('year', 1),
on = ['ID', 'group'],
how = 'outer', sort = False).drop('group', axis=1)
print (full)
ID year age from to
0 810006862 2000 49.0 15341.0 15705.0
1 810006862 2001 NaN 15706.0 16070.0
2 810006862 2002 NaN 16071.0 16436.0
3 810006862 2003 52.0 NaN NaN
4 810023112 2003 27.0 14610.0 14975.0
5 810023112 2004 28.0 14976.0 15340.0
6 810023112 2005 29.0 15706.0 16070.0
7 810023112 2006 30.0 NaN NaN
8 810033622 2000 24.0 14610.0 14975.0
9 810033622 2001 25.0 14976.0 15340.0
但是如果需要也可以按years
进行匹配:
full = pd.merge(sp, op, on = ['ID', 'year'], how = 'outer', sort = False)
print (full)
ID year age from to
0 810006862 2000 49.0 NaN NaN
1 810006862 2001 NaN NaN NaN
2 810006862 2002 NaN 15341.0 15705.0
3 810006862 2003 52.0 15706.0 16070.0
4 810023112 2003 27.0 15706.0 16070.0
5 810023112 2004 28.0 NaN NaN
6 810023112 2005 29.0 NaN NaN
7 810023112 2006 30.0 NaN NaN
8 810033622 2000 24.0 14610.0 14975.0
9 810033622 2001 25.0 14976.0 15340.0
10 810006862 2004 NaN 16071.0 16436.0
11 810006862 2005 NaN NaN NaN
12 810023112 2000 NaN 14610.0 14975.0
13 810023112 2001 NaN 14976.0 15340.0