我有 2 个 pandas 数据框
SC
和 SB
。
SC
包含足球运动员在比赛中的身体统计数据。SB
包含足球运动员在比赛中跟踪统计数据。# Sample data for SC (physical statistics)
data_sc = {
'Player ID': [1, 2, 3, 4],
'Player': ['Cristiano Ronaldo', 'Leo Messi', 'Neymar Jr.', 'Erling Haaland'],
'D.O.B.': ['1985-02-05', '1987-06-24', '1992-02-05', '1991-06-28'],
'Competition': ['La Liga', 'La Liga', 'Ligue 1', 'Premier League'],
'SC Rating': [90, 91, 92, 93],
}
SC = pd.DataFrame(data_sc)
# Sample data for SB (tracking statistics)
data_sb = {
'Player ID': [101, 102, 103, 104],
'Player': ['Cristiano Ronaldo dos Santos Aveiro', 'Lionel Messi', 'Neymar', 'Erling Haland'],
'D.O.B.': ['1985-02-05', '1987-06-23', '1992-02-05', '1991-06-29'],
'Competition': ['La Liga', 'La Liga', 'Ligue 1', 'Premier League'],
'SB Rating': [91, 92, 93, 94],
}
SB = pd.DataFrame(data_sb)
所需输出:
Player ID Player D.O.B. Competition SC Rating SB Rating
0 1 Cristiano Ronaldo 1985-02-05 La Liga 90 91
1 2 Lionel Messi 1987-06-24 La Liga 91 92
2 3 Neymar Jr. 1992-02-05 Ligue 1 92 93
3 4 Erling Haaland 1991-06-28 Premier League 93 94
这两个数据框具有以下共同特征:
Player ID
、Player
、D.O.B.
、Competition
。
我想合并这些数据框,但是它们来自不同的数据源,因此其变量具有不同的格式和约定。虽然两个数据帧都具有具有唯一数值的特征“玩家 ID”,但数据集中的 ID 不同(即同一玩家的 ID 值不同)。
问题在于这些功能的格式不一致。例如,数据帧之间
Player
中的名称可能不同,如果玩家有多个名称,则 SC
可能会使用与 SB
不同的玩家名称变体(和/或拼写)。此外,玩家的 D.O.B.
也存在不一致,导致同一玩家的 SC
和 SB
的出生日期不同。
我应该如何处理这个合并?
使用 fuzzywuzzy 对玩家姓名进行匹配得分。然后还允许 DOB 有一定的容差。所以像这样:
import pandas as pd
from fuzzywuzzy import process
from datetime import timedelta
# Sample data for SC (physical statistics)
data_sc = {
'Player ID': [1, 2, 3, 4],
'Player': ['Cristiano Ronaldo', 'Leo Messi', 'Neymar Jr.', 'Erling Haaland'],
'D.O.B.': ['1985-02-05', '1987-06-24', '1992-02-05', '1991-06-28'],
'Competition': ['La Liga', 'La Liga', 'Ligue 1', 'Premier League'],
'SC Rating': [90, 91, 92, 93],
}
SC = pd.DataFrame(data_sc)
# Sample data for SB (tracking statistics)
data_sb = {
'Player ID': [101, 102, 103, 104],
'Player': ['Cristiano Ronaldo dos Santos Aveiro', 'Lionel Messi', 'Neymar', 'Erling Haland'],
'D.O.B.': ['1985-02-05', '1987-06-23', '1992-02-05', '1991-06-29'],
'Competition': ['La Liga', 'La Liga', 'Ligue 1', 'Premier League'],
'SB Rating': [91, 92, 93, 94],
}
SB = pd.DataFrame(data_sb)
def fuzzy_date_matching_with_score(df1, df2, player_key1, player_key2, date_key1, date_key2, threshold=90, date_tolerance_days=1):
# Fuzzy matching for player names and storing the best match and its score
matches = df1[player_key1].apply(
lambda x: process.extractOne(x, df2[player_key2])) # Use extractOne to get the best match and its score
# Only keep matches with a score above the threshold
df1['match_name'] = matches.apply(lambda x: x[0] if x[1] >= threshold else None)
df1['match_score'] = matches.apply(lambda x: x[1] if x[1] >= threshold else None) # Store the score
# Prepare for date comparison by ensuring dates are datetime objects
df1[date_key1] = pd.to_datetime(df1[date_key1])
df2[date_key2] = pd.to_datetime(df2[date_key2])
# Expand df2 for merging
df2_expanded = pd.concat([
df2.assign(**{date_key2: df2[date_key2] + timedelta(days=i)})
for i in range(-date_tolerance_days, date_tolerance_days + 1)
])
# Merge based on exact match for dates now and fuzzy matched names
merged = pd.merge(df1, df2_expanded, left_on=[date_key1, 'match_name'], right_on=[date_key2, player_key2])
# Include only the rows where there is a match name
return merged[merged['match_name'].notna()]
merged_df = fuzzy_date_matching_with_score(SC, SB, 'Player', 'Player', 'D.O.B.', 'D.O.B.', threshold=70, date_tolerance_days=1)
输出:
print(merged_df.to_string())
Player ID_x Player_x D.O.B. Competition_x SC Rating match_name match_score Player ID_y Player_y Competition_y SB Rating
0 1 Cristiano Ronaldo 1985-02-05 La Liga 90 Cristiano Ronaldo dos Santos Aveiro 90 101 Cristiano Ronaldo dos Santos Aveiro La Liga 91
1 2 Leo Messi 1987-06-24 La Liga 91 Lionel Messi 76 102 Lionel Messi La Liga 92
2 3 Neymar Jr. 1992-02-05 Ligue 1 92 Neymar 90 103 Neymar Ligue 1 93
3 4 Erling Haaland 1991-06-28 Premier League 93 Erling Haland 96 104 Erling Haland Premier League 94