我想用Python 3将这个网站上的一个图表转换成一个.csv文件。2013-14年NBA全国电视时间表
图表的开头是这样的。
Game/Time Network Matchup
Oct. 29, 8 p.m. ET TNT Chicago vs. Miami
Oct. 29, 10:30 p.m. ET TNT LA Clippers vs. LA Lakers
我正在使用这些软件包。
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from itertools import groupby
我导入的数据是:
pd.read_html("https://www.sbnation.com/2013/8/6/4595688/2013-14-nba-national-tv-schedule")[0]
输出样本是:
0 1 2
0 Game/Time Network Matchup
1 Oct. 29, 8 p.m. ET TNT Chicago vs. Miami
2 Oct. 29, 10:30 p.m. ET TNT LA Clippers vs. LA Lakers
我想要的csv文件的输出是这样的:
我不知道如何将游戏时间分成不同的列。请注意日期的格式是102913。我也不知道如何将比赛时间分为客场(第一队)和主场(第二队)两列。我知道 pd.to_datetime
和 str.split()
应该使用。我如何实现刮擦器来获得这个输出?
df['Date']=df['Date'].dt.strftime('%m/%d/%Y')
这一行应该可以帮助你按照你想要的方式来格式化日期。
import pandas as pd
import numpy as np
df = pd.read_html("https://www.sbnation.com/2013/8/6/4595688/2013-14-nba-national-tv-schedule",header=0)[0]
df['Date']=df['Game/Time'].str.extract(r'(.*),',expand=True)
df['Time']=df['Game/Time'].str.extract(r',(.*) ET',expand=True)
df['Time']=df['Time'].str.replace('p.m.','PM')
df['Date'] = np.where(df.Date.str.startswith(('10/', 11/', '12/')), df.Date + ' 13', df.Date + ' 14')
df['Date']=pd.to_datetime(df['Date'])
df['Date']=df['Date'].dt.strftime('%m/%d/%Y')
df['Home'] = df['Matchup'].str.extract('(.*)vs')
df['Away'] = df['Matchup'].str.extract('vs.(.*)')
df = df.drop(columns=['Game/Time','Matchup'])
print(df)
Network Date Time Home Away
0 TNT 10/29/2013 8 PM Chicago Miami
1 TNT 10/29/2013 10:30 PM LA Clippers LA Lakers
2 TNT 10/31/2013 8 PM New York Chicago
3 TNT 10/31/2013 10:30 PM Golden State LA Clippers
4 ESPN 11/01/2013 8 PM Miami Brooklyn
我希望这是你要找的东西。
这是我的看法。
df = pd.read_html("https://www.sbnation.com/2013/8/6/4595688/2013-14-nba-national-tv-schedule")[0]
# set the correct column names
df = df.T.set_index([0]).T
# separate date and time
datetime = df['Game/Time'].str.extract('(?P<Date>.*), (?P<Time>.*) ET$')
# extract Home and Away
home_away = df['Matchup'].str.extract('^(?P<Away>.*) vs\. (?P<Home>.*)$')
# join the data
final_df = pd.concat([datetime, home_away, df[['Network']]], axis=1)
输出。
Date Time Away Home Network
1 Oct. 29 8 p.m. Chicago Miami TNT
2 Oct. 29 10:30 p.m. LA Clippers LA Lakers TNT
3 Oct. 31 8 p.m. New York Chicago TNT
4 Oct. 31 10:30 p.m. Golden State LA Clippers TNT
5 Nov. 1 8 p.m. Miami Brooklyn ESPN
.. ... ... ... ... ...
141 Apr. 13 1 p.m. Chicago New York ABC
142 Apr. 15 8 p.m. New York Brooklyn TNT
143 Apr. 15 10:30 p.m. Denver LA Clippers TNT
144 Apr. 16 8 p.m. Atlanta Milwaukee ESPN
145 Apr. 16 10:30 p.m. Golden State Denver ESPN
你可以使用 regex
来分割你的列,你的 time
有不同的格式,所以我们可以通过使用特定的格式来处理这些问题,并将错误强制转化为NaT值。
df = pd.read_html("https://www.sbnation.com/2013/8/6/4595688/2013-14-nba-national-tv-schedule")[0]
# set column
df.columns = df.iloc[0]
df = df.iloc[1:].reset_index(drop=True)
#set date and time column.
df['date'] = pd.to_datetime((df['Game/Time'].str.split(',',expand=True)[0] + ' 2019')
,format='%b. %d %Y')
df['time'] = df['Game/Time'].str.split(',',expand=True)[1]
#time column has different formats, lets handle those.
s = pd.to_datetime(df['time'].str.strip('ET').str.replace('\.','').str.strip(),
format='%H %p',errors='coerce')
s = s.fillna(pd.to_datetime(df['time'].str.strip('ET').str.replace('\.','').str.strip(),
format='%H:%M %p',errors='coerce'))
df['time'] = s.dt.time
#home and away columns.
df['home'] = df['Matchup'].str.extract('(.*)vs(.*)')[0].str.strip()
df['away'] = df['Matchup'].str.extract('(.*)vs(.*)')[1].str.strip('.')
# slice dataframe.
df2 = df[['date','time','home','away','Network']]
print(df2)
0 date time home away Network
0 2019-10-29 08:00:00 Chicago Miami TNT
1 2019-10-29 10:30:00 LA Clippers LA Lakers TNT
2 2019-10-31 08:00:00 New York Chicago TNT
3 2019-10-31 10:30:00 Golden State LA Clippers TNT
4 2019-11-01 08:00:00 Miami Brooklyn ESPN
.. ... ... ... ... ...
140 2019-04-13 01:00:00 Chicago New York ABC
141 2019-04-15 08:00:00 New York Brooklyn TNT
142 2019-04-15 10:30:00 Denver LA Clippers TNT
143 2019-04-16 08:00:00 Atlanta Milwaukee ESPN
144 2019-04-16 10:30:00 Golden State Denver ESPN