嗨,我想问一下如何将某些行从一个Excel文件复制到另一个Excel文件。通过使用python模糊匹配方法或ANY其他可行的方法,希望根据名称将整个行匹配并复制到新的excel文件中。
这是来自第一个excel文件的输入数据,总共有13行6列,如下所示:
-----------------------------------------------------|-----|-----|-----|-----|-----|
| name | no1 | no2 | no3 | no4 | no5 |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long to Club___Short___Water | abc | abc | abc | abc | abc |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long to Short___Water | def | def | def | def | def |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long___Land to Short___Water | def | def | def | def | def |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Kinabalu___BB to Penang___AA | def | def | def | def | def |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Kinabalu___SD to Penang___SD | def | def | def | def | def |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice | def | def | def | def | def |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice | def | def | def | def | def |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice | def | def | def | def | def |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Front___House___AA(N) to Back___Garden(N) | def | def | def | def | def |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Front___House___AA___(N) to Back___Garden | def | def | def | def | def |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Left___House___Hostel(w) to NothingNow___(w) | def | def | def | def | def |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Laksama to Kota_Dun | def | def | def | def | def |
-----------------------------------------------------|-----|-----|-----|-----|-----|
通过插入第一行,我想让python识别行的近似相似名称,然后复制整行并粘贴到新的excel文件中。通过比较单词而不是字母的相似性,例如有多少个单词相同,如果大于或等于某个数量(例如说50%),它将通过复制。
例如,通过比较第2行和第3行,from Club___Long to Club___Short___Water
非常类似于from Club___Long to Short___Water
,from Club___Long to Club___Short___Water
有7个单词,而from Club___Long to Short___Water
有6个单词。在from Club___Long to Club___Short___Water
的7个单词中,有6个与from Club___Long to Short___Water
相似的单词。因此,6 / 7 * 100% = 85.71%
大于50%,python会将其视为匹配并复制。
例如,第2行到第4行大致相同,因此python将对其进行匹配并识别出几乎相同的内容,仅将整个第2行复制到整个第4行到新的excel文件中,并将其命名为'new_file_1.xlsx '。所需的输出如下所示:
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long to Club___Short___Water | abc | abc | abc | abc | abc |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long to Short___Water | def | def | def | def | def |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long___Land to Short___Water | def | def | def | def | def |
-----------------------------------------------------|-----|-----|-----|-----|-----|
第5行和第6行相同,并将其命名为'new_file_2.xlsx',所需的输出如下所示:
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Kinabalu___BB to Penang___AA | def | def | def | def | def |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Kinabalu___SD to Penang___SD | def | def | def | def | def |
-----------------------------------------------------|-----|-----|-----|-----|-----|
第7行到第9行相同,并将其命名为'new_file_3.xlsx',所需的输出如下所示:
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice | def | def | def | def | def |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice | def | def | def | def | def |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice | def | def | def | def | def |
-----------------------------------------------------|-----|-----|-----|-----|-----|
第10行到第11行相同,并将其命名为'new_file_4.xlsx',所需的输出如下所示:
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Front___House___AA(N) to Back___Garden(N) | def | def | def | def | def |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Front___House___AA___(N) to Back___Garden | def | def | def | def | def |
-----------------------------------------------------|-----|-----|-----|-----|-----|
关于第12行和第13行,它们都与其他行不同,因此不必复制,只需保留它即可。
非常感谢任何人能帮助我,谢谢!
我创建了一个替换重复项的函数。它基于模糊逻辑。考虑到截断,我只用所有其他名称的最高匹配项替换每个名称。然后,创建一个新列,在其中存储这些唯一的名称
import difflib
import re
def similarity_replace(series):
reverse_map = {}
diz_map = {}
for i,s in series.iteritems():
clean_s = re.sub(r'(from)|(to)', '', s.lower())
clean_s = re.sub(r'[^a-z]', '', clean_s)
diz_map[s] = clean_s
reverse_map[clean_s] = s
best_match = {}
uni = list(set(diz_map.values()))
for w in uni:
best_match[w] = sorted(difflib.get_close_matches(w, uni, n=3, cutoff=0.6))[0]
return series.map(diz_map).map(best_match).map(reverse_map)
df = pd.DataFrame({'name':['from Club___Long to Club___Short___Water','from Club___Long to Short___Water',
'from Club___Long___Land to Short___Water','from Kinabalu___BB to Penang___AA',
'from Kinabalu___SD to Penang___SD','from Hill___Town to Unknown___Island___Ice',
'from Hill___Town to Unknown___Island___Ice','from Hill___Town to Unknown___Island___Ice',
'from Front___House___AA(N) to Back___Garden(N)','from Front___House___AA___(N) to Back___Garden',
'from Left___House___Hostel(w) to NothingNow___(w)','from Laksama to Kota_Dun'],
'no1':['adb','adb','adb','adb','adb','adb','adb','adb','adb','adb','adb','adb']})
df['group_name'] = similarity_replace(df.name)
df
我们可以使用此列将所有相似且相似的实例归为一组
for i,group in df.groupby('group_name'):
### do something ###
print(group[['name','no1']])