删除与标题行匹配的重复行

问题描述 投票:0回答:1

背景:我年复一年地从 URL 中提取 NFL 数据,并且标题行重复多次。

部分解决方案:我最初尝试了.drop_duplicates(),但它给我留下了两个“标题”行(2023赛季尚未结束,所以未来的游戏有不同的不同标题行)。我认为是因为它删除了重复项,一旦只剩下两个 - 它们被认为是“唯一的”?

丑陋的解决方案:我已经使用关键字解决了我的问题,但我确信有更好的方法来做到这一点。

我使用的线路如下,但我确信有更好的方法:

    scorings_df = scorings_df[~scorings_df['Week'].str.contains("week", case=False, na=False)]

请注意,周列可以包含数字和文本,具体取决于周

完整的代码如下,请注意,它正在提取 25 年的数据,因此如果您要运行它 - 您可能希望将其减少到 2/3,因为您不需要所有这些

current_year=2023 

# Scoring History
url_scores = 'https://www.pro-football-reference.com/years/'

Scoring_df=pd.read_html(url_scores)[0]

# Create empty dataframe to store url data
Scoring_df = pd.DataFrame()

# Loop over the required number of years starting this year
for year in range(current_year, current_year -25, -1):
    # Generate the url string for each year    
    year_url_stats=f"{url_scores}{year}/games.htm"
    
    # Fetch the data from that url
    Scores = pd.read_html(year_url_stats)
    
    scorings_df=Scores[0]    # Ensure we're taking the first table on the url
    # Remove the Header Row duplicates (.drop_duplicates() not sufficient) & Future games
    scorings_df = scorings_df[~scorings_df['Week'].str.contains("week", case=False, na=False)]
    # Convert the 'Date' column to datetime for filtering
    scorings_df['TempDate'] = pd.to_datetime(scorings_df['Date'], errors='coerce')
    scorings_df = scorings_df[scorings_df['TempDate'] <= datetime.datetime.now()]
    scorings_df.drop(columns=['TempDate'], inplace=True)  


    # Add a column for 'Year' so we can filter later
    scorings_df['Year'] = year
    
    # Clean up Team names
    scorings_df['Winner/tie'] = scorings_df['Winner/tie'].str.split().str[-1]
    scorings_df['Loser/tie'] = scorings_df['Loser/tie'].str.split().str[-1]

    
    # Reorder columns (reduces from 15 to 13)
    columns_order = ['Year', 'Week', 'Day', 'Date', 'Time', 'Winner/tie', 'Loser/tie', 'PtsW', 'PtsL', 'YdsW', 'TOW', 'YdsL', 'TOL']
    scorings_df = scorings_df[columns_order]
    
    
    # Append to the dataframe
    Scoring_df = pd.concat([Scoring_df, scorings_df], ignore_index=True)
    
print(Scoring_df.head)
python duplicates drop-duplicates
1个回答
0
投票

您可以以不同的方式利用 Pandas drop_duplicates() 函数。 将 pandas 导入为 pd 导入日期时间

scoring_df = pd.DataFrame()  

for year in range(current_year, current_year - 25, -1):
    year_url_stats = f"{url_scores}{year}/games.htm"
    scores = pd.read_html(year_url_stats)
    scorings_df = scores[0]
    scorings_df = scorings_df[~scorings_df.apply(tuple, axis=1).duplicated()]
    scorings_df['TempDate'] = pd.to_datetime(scorings_df['Date'], errors='coerce')
    scorings_df = scorings_df[scorings_df['TempDate'] <= datetime.datetime.now()]
    scorings_df.drop(columns=['TempDate'], inplace=True)
    scorings_df['Year'] = year
    scorings_df['Winner/tie'] = scorings_df['Winner/tie'].str.split().str[-1]
    scorings_df['Loser/tie'] = scorings_df['Loser/tie'].str.split().str[-1]
    scoring_df = scoring_df.append(scorings_df)
© www.soinside.com 2019 - 2024. All rights reserved.