背景:我年复一年地从 URL 中提取 NFL 数据,并且标题行重复多次。
部分解决方案:我最初尝试了.drop_duplicates(),但它给我留下了两个“标题”行(2023赛季尚未结束,所以未来的游戏有不同的不同标题行)。我认为是因为它删除了重复项,一旦只剩下两个 - 它们被认为是“唯一的”?
丑陋的解决方案:我已经使用关键字解决了我的问题,但我确信有更好的方法来做到这一点。
我使用的线路如下,但我确信有更好的方法:
scorings_df = scorings_df[~scorings_df['Week'].str.contains("week", case=False, na=False)]
请注意,周列可以包含数字和文本,具体取决于周
完整的代码如下,请注意,它正在提取 25 年的数据,因此如果您要运行它 - 您可能希望将其减少到 2/3,因为您不需要所有这些
current_year=2023
# Scoring History
url_scores = 'https://www.pro-football-reference.com/years/'
Scoring_df=pd.read_html(url_scores)[0]
# Create empty dataframe to store url data
Scoring_df = pd.DataFrame()
# Loop over the required number of years starting this year
for year in range(current_year, current_year -25, -1):
# Generate the url string for each year
year_url_stats=f"{url_scores}{year}/games.htm"
# Fetch the data from that url
Scores = pd.read_html(year_url_stats)
scorings_df=Scores[0] # Ensure we're taking the first table on the url
# Remove the Header Row duplicates (.drop_duplicates() not sufficient) & Future games
scorings_df = scorings_df[~scorings_df['Week'].str.contains("week", case=False, na=False)]
# Convert the 'Date' column to datetime for filtering
scorings_df['TempDate'] = pd.to_datetime(scorings_df['Date'], errors='coerce')
scorings_df = scorings_df[scorings_df['TempDate'] <= datetime.datetime.now()]
scorings_df.drop(columns=['TempDate'], inplace=True)
# Add a column for 'Year' so we can filter later
scorings_df['Year'] = year
# Clean up Team names
scorings_df['Winner/tie'] = scorings_df['Winner/tie'].str.split().str[-1]
scorings_df['Loser/tie'] = scorings_df['Loser/tie'].str.split().str[-1]
# Reorder columns (reduces from 15 to 13)
columns_order = ['Year', 'Week', 'Day', 'Date', 'Time', 'Winner/tie', 'Loser/tie', 'PtsW', 'PtsL', 'YdsW', 'TOW', 'YdsL', 'TOL']
scorings_df = scorings_df[columns_order]
# Append to the dataframe
Scoring_df = pd.concat([Scoring_df, scorings_df], ignore_index=True)
print(Scoring_df.head)
您可以以不同的方式利用 Pandas drop_duplicates() 函数。 将 pandas 导入为 pd 导入日期时间
scoring_df = pd.DataFrame()
for year in range(current_year, current_year - 25, -1):
year_url_stats = f"{url_scores}{year}/games.htm"
scores = pd.read_html(year_url_stats)
scorings_df = scores[0]
scorings_df = scorings_df[~scorings_df.apply(tuple, axis=1).duplicated()]
scorings_df['TempDate'] = pd.to_datetime(scorings_df['Date'], errors='coerce')
scorings_df = scorings_df[scorings_df['TempDate'] <= datetime.datetime.now()]
scorings_df.drop(columns=['TempDate'], inplace=True)
scorings_df['Year'] = year
scorings_df['Winner/tie'] = scorings_df['Winner/tie'].str.split().str[-1]
scorings_df['Loser/tie'] = scorings_df['Loser/tie'].str.split().str[-1]
scoring_df = scoring_df.append(scorings_df)