基于Python中的列合并两个不重复的CSV文件的正确方法

问题描述 投票:0回答:1
我想将 2 个 CSV 文件合并为一个 CSV,并根据一列(第二列)删除所有重复的行。

这是我的第一个 CSV 文件:

Skufnoo,748702985,-6026769894509215039,ВупÑень пупÑень â¤ï¸â€ðŸ©¹ðŸ’—,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,4,True,False,0 mAtkmb,5213786988,4161254730445748607,ДаниÑль Блинов,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,False,False,False,0 sheluvjoseph,1421438213,8544915453690665435,អន សំអុល,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,5,True,False,0
第二个 CSV 文件:

cchamnap,748702985,-7259273529368744780,Chim,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,5,True,False,0 chhounkha,765670208,3636141294788837002,Chhuon Sokha,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,False,False,False,0 CHHORMNIMOL8,5213786988,5104468652588260401,ឌី ណា.,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,5,True,False,0 Chhailin17,1133044248,6931066845789435875,Chhai Lin,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,5,True,False,0
输出文件(own_updated2.csv)应该是:

Skufnoo,748702985,-6026769894509215039,ВупÑень пупÑень â¤ï¸â€ðŸ©¹ðŸ’—,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,4,True,False,0 mAtkmb,5213786988,4161254730445748607,ДаниÑль Блинов,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,False,False,False,0 sheluvjoseph,1421438213,8544915453690665435,អន សំអុល,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,5,True,False,0 chhounkha,765670208,3636141294788837002,Chhuon Sokha,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,False,False,False,0 Chhailin17,1133044248,6931066845789435875,Chhai Lin,AA2888 ចាក់បាល់និងកាសុីណូអនឡាញ (070645555),1746008070,False,False,5,True,False,0
我尝试过以下代码:

import pandas as pd import csv df1 = pd.read_csv("own1.csv") df2 = pd.read_csv("own2.csv") merged = pd.concat([df1,df2]) with open('own_updated.csv', 'w', newline="", encoding='utf-8') as nf: merged.to_csv(nf, index=False) with open('own_updated.csv', 'r', encoding="utf8") as in_file, open('own_updated2.csv', 'w', newline="", encoding="utf8") as out_file: in_data = csv.reader(in_file, delimiter=',') writer=csv.writer(out_file) tracks = set() # Tracking duplicates of the second column's cell for row in in_data: key = row[1] if key not in tracks: writer.writerow(row) tracks.add(key)
效果很好。但问题是有一个不需要的额外文件 own_updated.csv。如何在不创建 own_updated.csv 文件的情况下存储合并两个 CSV 文件的所有数据,即将它们存储在内存中,然后根据第二列处理删除重复项?

python pandas csv
1个回答
0
投票
只需从合并的数据框中删除重复项

df1 = pd.read_csv('own1.csv', header=None) df2 = pd.read_csv('own2.csv', header=None) merged = pd.concat([df1, df2]).drop_duplicates([1], keep='first').reset_index(drop=True) with open('own_updated.csv', 'w', newline='', encoding='utf-8') as nf: merged.to_csv(nf, index=False, header=False)
    
© www.soinside.com 2019 - 2024. All rights reserved.