
问题描述 投票:0回答:1



| name                                               | no1 | no2 | no3 | no4 | no5 |
| from Club___Long to Club___Short___Water           | abc | abc | abc | abc | abc |
| from Club___Long to Short___Water                  | def | def | def | def | def |  
| from Club___Long___Land to Short___Water           | def | def | def | def | def |  
| from Kinabalu___BB to Penang___AA                  | def | def | def | def | def |  
| from Kinabalu___SD to Penang___SD                  | def | def | def | def | def |  
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
| from Front___House___AA(N) to Back___Garden(N)     | def | def | def | def | def |  
| from Front___House___AA___(N) to Back___Garden     | def | def | def | def | def |  
| from Left___House___Hostel(w) to NothingNow___(w)  | def | def | def | def | def |  
| from Laksama to Kota_Dun                           | def | def | def | def | def |  


例如,通过比较第2行和第3行,from Club___Long to Club___Short___Water非常类似于from Club___Long to Short___Waterfrom Club___Long to Club___Short___Water有7个单词,而from Club___Long to Short___Water有6个单词。在from Club___Long to Club___Short___Water的7个单词中,有6个与from Club___Long to Short___Water相似的单词。因此,6 / 7 * 100% = 85.71%大于50%,python会将其视为匹配并复制。

例如,第2行到第4行大致相同,因此python将对其进行匹配并识别出几乎相同的内容,仅将整个第2行复制到整个第4行到新的excel文件中,并将其命名为'new_file_1.xlsx '。所需的输出如下所示:

| from Club___Long to Club___Short___Water           | abc | abc | abc | abc | abc |
| from Club___Long to Short___Water                  | def | def | def | def | def |  
| from Club___Long___Land to Short___Water           | def | def | def | def | def |  


| from Kinabalu___BB to Penang___AA                  | def | def | def | def | def |  
| from Kinabalu___SD to Penang___SD                  | def | def | def | def | def |  


| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  


| from Front___House___AA(N) to Back___Garden(N)     | def | def | def | def | def |  
| from Front___House___AA___(N) to Back___Garden     | def | def | def | def | def |  



python pandas matching fuzzy


import difflib
import re

def similarity_replace(series):

    reverse_map = {}
    diz_map = {}
    for i,s in series.iteritems():

        clean_s = re.sub(r'(from)|(to)', '', s.lower())
        clean_s = re.sub(r'[^a-z]', '', clean_s)

        diz_map[s] = clean_s
        reverse_map[clean_s] = s

    best_match = {}
    uni = list(set(diz_map.values()))
    for w in uni:
        best_match[w] = sorted(difflib.get_close_matches(w, uni, n=3, cutoff=0.6))[0]

    return series.map(diz_map).map(best_match).map(reverse_map)

df = pd.DataFrame({'name':['from Club___Long to Club___Short___Water','from Club___Long to Short___Water',
                           'from Club___Long___Land to Short___Water','from Kinabalu___BB to Penang___AA',
                           'from Kinabalu___SD to Penang___SD','from Hill___Town to Unknown___Island___Ice',
                           'from Hill___Town to Unknown___Island___Ice','from Hill___Town to Unknown___Island___Ice',
                           'from Front___House___AA(N) to Back___Garden(N)','from Front___House___AA___(N) to Back___Garden',
                           'from Left___House___Hostel(w) to NothingNow___(w)','from Laksama to Kota_Dun'],

df['group_name'] = similarity_replace(df.name)

enter image description here


for i,group in df.groupby('group_name'):

    ### do something ###
© www.soinside.com 2019 - 2024. All rights reserved.