合并不一致的 pandas 数据框

问题描述 投票:0回答:1

我有 2 个 pandas 数据框

SC
SB

  • SC
    包含足球运动员在比赛中的身体统计数据。
  • SB
    包含足球运动员在比赛中跟踪统计数据。
# Sample data for SC (physical statistics)
data_sc = {
    'Player ID': [1, 2, 3, 4],
    'Player': ['Cristiano Ronaldo', 'Leo Messi', 'Neymar Jr.', 'Erling Haaland'],
    'D.O.B.': ['1985-02-05', '1987-06-24', '1992-02-05', '1991-06-28'],
    'Competition': ['La Liga', 'La Liga', 'Ligue 1', 'Premier League'],
    'SC Rating': [90, 91, 92, 93],
}

SC = pd.DataFrame(data_sc)

# Sample data for SB (tracking statistics)
data_sb = {
    'Player ID': [101, 102, 103, 104],
    'Player': ['Cristiano Ronaldo dos Santos Aveiro', 'Lionel Messi', 'Neymar', 'Erling Haland'],
    'D.O.B.': ['1985-02-05', '1987-06-23', '1992-02-05', '1991-06-29'],
    'Competition': ['La Liga', 'La Liga', 'Ligue 1', 'Premier League'],
    'SB Rating': [91, 92, 93, 94],
}

SB = pd.DataFrame(data_sb)

所需输出:

   Player ID              Player      D.O.B.     Competition  SC Rating  SB Rating
0          1   Cristiano Ronaldo  1985-02-05         La Liga         90         91
1          2        Lionel Messi  1987-06-24         La Liga         91         92
2          3          Neymar Jr.  1992-02-05         Ligue 1         92         93
3          4      Erling Haaland  1991-06-28  Premier League         93         94

这两个数据框具有以下共同特征:

Player ID
Player
D.O.B.
Competition

我想合并这些数据框,但是它们来自不同的数据源,因此其变量具有不同的格式和约定。虽然两个数据帧都具有具有唯一数值的特征“玩家 ID”,但数据集中的 ID 不同(即同一玩家的 ID 值不同)。

问题在于这些功能的格式不一致。例如,数据帧之间

Player
中的名称可能不同,如果玩家有多个名称,则
SC
可能会使用与
SB
不同的玩家名称变体(和/或拼写)。此外,玩家的
D.O.B.
也存在不一致,导致同一玩家的
SC
SB
的出生日期不同。

我应该如何处理这个合并?

python pandas dataframe merge
1个回答
0
投票

使用 fuzzywuzzy 对玩家姓名进行匹配得分。然后还允许 DOB 有一定的容差。所以像这样:

import pandas as pd
from fuzzywuzzy import process
from datetime import timedelta

# Sample data for SC (physical statistics)
data_sc = {
    'Player ID': [1, 2, 3, 4],
    'Player': ['Cristiano Ronaldo', 'Leo Messi', 'Neymar Jr.', 'Erling Haaland'],
    'D.O.B.': ['1985-02-05', '1987-06-24', '1992-02-05', '1991-06-28'],
    'Competition': ['La Liga', 'La Liga', 'Ligue 1', 'Premier League'],
    'SC Rating': [90, 91, 92, 93],
}

SC = pd.DataFrame(data_sc)

# Sample data for SB (tracking statistics)
data_sb = {
    'Player ID': [101, 102, 103, 104],
    'Player': ['Cristiano Ronaldo dos Santos Aveiro', 'Lionel Messi', 'Neymar', 'Erling Haland'],
    'D.O.B.': ['1985-02-05', '1987-06-23', '1992-02-05', '1991-06-29'],
    'Competition': ['La Liga', 'La Liga', 'Ligue 1', 'Premier League'],
    'SB Rating': [91, 92, 93, 94],
}

SB = pd.DataFrame(data_sb)




def fuzzy_date_matching_with_score(df1, df2, player_key1, player_key2, date_key1, date_key2, threshold=90, date_tolerance_days=1):
    # Fuzzy matching for player names and storing the best match and its score
    matches = df1[player_key1].apply(
        lambda x: process.extractOne(x, df2[player_key2]))  # Use extractOne to get the best match and its score
    
    # Only keep matches with a score above the threshold
    df1['match_name'] = matches.apply(lambda x: x[0] if x[1] >= threshold else None)
    df1['match_score'] = matches.apply(lambda x: x[1] if x[1] >= threshold else None)  # Store the score

    # Prepare for date comparison by ensuring dates are datetime objects
    df1[date_key1] = pd.to_datetime(df1[date_key1])
    df2[date_key2] = pd.to_datetime(df2[date_key2])

    # Expand df2 for merging
    df2_expanded = pd.concat([
        df2.assign(**{date_key2: df2[date_key2] + timedelta(days=i)})
        for i in range(-date_tolerance_days, date_tolerance_days + 1)
    ])

    # Merge based on exact match for dates now and fuzzy matched names
    merged = pd.merge(df1, df2_expanded, left_on=[date_key1, 'match_name'], right_on=[date_key2, player_key2])
    # Include only the rows where there is a match name
    return merged[merged['match_name'].notna()]


merged_df = fuzzy_date_matching_with_score(SC, SB, 'Player', 'Player', 'D.O.B.', 'D.O.B.', threshold=70, date_tolerance_days=1)

输出:

print(merged_df.to_string())
   Player ID_x           Player_x     D.O.B.   Competition_x  SC Rating                           match_name  match_score  Player ID_y                             Player_y   Competition_y  SB Rating
0            1  Cristiano Ronaldo 1985-02-05         La Liga         90  Cristiano Ronaldo dos Santos Aveiro           90          101  Cristiano Ronaldo dos Santos Aveiro         La Liga         91
1            2          Leo Messi 1987-06-24         La Liga         91                         Lionel Messi           76          102                         Lionel Messi         La Liga         92
2            3         Neymar Jr. 1992-02-05         Ligue 1         92                               Neymar           90          103                               Neymar         Ligue 1         93
3            4     Erling Haaland 1991-06-28  Premier League         93                        Erling Haland           96          104                        Erling Haland  Premier League         94
© www.soinside.com 2019 - 2024. All rights reserved.