使用Python进行网页抓取时如何重新格式化表格的一半?

问题描述 投票:0回答:1

[我正在尝试使用Python 3将本网站的表格抓取到.csv文件中:2011-2012 NBA National TV Schedule

表格开始时像:

                Revised Schedule                    Original Schedule

Date            Time      Game                Net   Time      Game                  Net
Sun., 12/25/11  12 PM     BOS (1) at NY (1)   TNT   12 PM     BOS (7) at NY (7)     ESPN
Sun., 12/25/11  2:30 PM   MIA (1) at DAL (1)  ABC   2:30 PM   MIA (8) at DAL (5)    ABC
Sun., 12/25/11  5 PM      CHI (1) at LAL (1)  ABC   5 PM      CHI (6) at LAL (9)    ABC
Sun., 12/25/11  8 PM      ORL (1) at OKC (1)  ESPN  no game   no game               no game
Sun., 12/25/11  10:30 PM  LAC (1) at GS (1)   ESPN  no game   no game               no game
Tue., 12/27/11  8 PM      BOS (2) at MIA (2)  TNT   no game   no game               no game
Tue., 12/27/11  10:30 PM  UTA (1) at LAL (2)  TNT   no game   no game               no game

我正在使用这些软件包:

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from itertools import groupby

我导入了数据,输出与上表匹配:

pd.read_html("https://www.sportsmediawatch.com/2011/12/revised-2011-12-nba-national-tv-schedule/")[0]

我只对前四列的修订时间表感兴趣。我想要的.csv文件中的输出如下所示:

Output in .csv File

我不确定如何删除球队名称旁边的比赛中的(1),(2)等。我知道应该使用pd.to_datetimestr.split()。如何实现刮板以获得此输出?

python pandas web-scraping beautifulsoup screen-scraping
1个回答
0
投票

这里只是一些熊猫的操纵。获取其中包含字符串"Revised"的列名(使用列表推导),然后选择那些列(包括'Date'列):

此外,您也开始提出很多相同的问题,并且付出了最小的努力。在某些时候,您需要将先前问题中所学的内容应用于“新”问题,然后您将自己回答(即,在'@'处拆分一列(已回答here。)。我尝试通过在代码/逗号中提供解释来帮助您的。)的概念与您在'vs.'寻求帮助here时的分裂是相同的概念。解决方案(当您基本上可以从自己那里获得解决方案时)将阻碍您的学习。

清单理解:

revisedCols = ['Date'] + [ col for col in df.columns if 'Revised' in col ]

是此for循环的形式:

revisedCols = ['Date']
for col in df.columns:
    if 'Revised' in col:
        revisedCols.append(col)

代码:

import pandas as pd

df = pd.read_html("https://www.sportsmediawatch.com/2011/12/revised-2011-12-nba-national-tv-schedule/", header=0)[0]

revisedCols = ['Date'] + [ col for col in df.columns if 'Revised' in col ]
df = df[revisedCols]

df.columns = df.iloc[0,:]
df = df.iloc[1:,:].reset_index(drop=True)

输出:

print (df)
0              Date      Time                  Game   Net
0    Sun., 12/25/11     12 PM     BOS (1) at NY (1)   TNT
1    Sun., 12/25/11   2:30 PM    MIA (1) at DAL (1)   ABC
2    Sun., 12/25/11      5 PM    CHI (1) at LAL (1)   ABC
3    Sun., 12/25/11      8 PM    ORL (1) at OKC (1)  ESPN
4    Sun., 12/25/11  10:30 PM     LAC (1) at GS (1)  ESPN
5    Tue., 12/27/11      8 PM    BOS (2) at MIA (2)   TNT
6    Tue., 12/27/11  10:30 PM    UTA (1) at LAL (2)   TNT
7    Thu., 12/29/11      8 PM    DAL (2) at OKC (2)   TNT
8    Thu., 12/29/11  10:30 PM     NY (2) at LAL (3)   TNT
9      Thu., 1/5/12      8 PM    MIA (3) at ATL (1)   TNT
10     Thu., 1/5/12  10:30 PM    LAL (4) at POR (1)   TNT
11     Fri., 1/6/12      8 PM    CHI (2) at ORL (2)  ESPN
12     Fri., 1/6/12  10:30 PM    POR (2) at PHX (1)  ESPN
13    Wed., 1/11/12      8 PM    DAL (3) at BOS (3)  ESPN
14    Wed., 1/11/12  10:30 PM    MIA (4) at LAC (2)  ESPN
15    Thu., 1/12/12      8 PM     NY (3) at MEM (1)   TNT
16    Thu., 1/12/12  10:30 PM     ORL (3) at GS (2)   TNT
17    Fri., 1/13/12      8 PM    CHI (3) at BOS (4)  ESPN
18    Fri., 1/13/12  10:30 PM    MIA (5) at DEN (1)  ESPN
19     Mon, 1/16/12      1 PM    CHI (4) at MEM (2)  ESPN
20     Mon, 1/16/12      8 PM    OKC (3) at BOS (5)   TNT
21     Mon, 1/16/12  10:30 PM    DAL (4) at LAL (5)   TNT
22    Wed., 1/18/12      8 PM    POR (3) at ATL (2)  ESPN
23    Wed., 1/18/12  10:30 PM    DAL (5) at LAC (3)  ESPN
24    Thu., 1/19/12      8 PM    LAL (6) at MIA (6)   TNT
25    Thu., 1/19/12  10:30 PM    DAL (6) at UTA (2)   TNT
26    Fri., 1/20/12      8 PM    LAL (7) at ORL (4)  ESPN
27    Fri., 1/20/12  10:30 PM    MIN (1) at LAC (4)  ESPN
28    Thu., 1/26/12      8 PM    BOS (6) at ORL (5)   TNT
29    Thu., 1/26/12  10:30 PM    MEM (3) at LAC (5)   TNT
..              ...       ...                   ...   ...
105   Tue., 4/10/12      7 PM  BOS (19) at MIA (21)  ESPN
106   Tue., 4/10/12   9:30 PM   NY (17) at CHI (19)  ESPN
107   Wed., 4/11/12      8 PM   ATL (5) at BOS (20)  ESPN
108   Wed., 4/11/12  10:30 PM    GS (6) at POR (10)  ESPN
109   Thu., 4/12/12      8 PM  MIA (22) at CHI (20)   TNT
110   Thu., 4/12/12  10:30 PM    DAL (17) at GS (7)   TNT
111   Fri., 4/13/12      8 PM    MIL (1) at DET (1)  ESPN
112   Fri., 4/13/12  10:30 PM  DAL (18) at POR (11)  ESPN
113   Sat., 4/14/12      9 PM     PHX (7) at SA (5)  ESPN
114   Sun., 4/15/12      1 PM   MIA (23) at NY (18)   ABC
115   Sun., 4/15/12   3:30 PM  DAL (19) at LAL (21)   ABC
116   Tue., 4/17/12      8 PM   BOS (21) at NY (19)   TNT
117   Tue., 4/17/12  10:30 PM    SA (6) at LAL (22)   TNT
118   Wed., 4/18/12      8 PM  ORL (15) at BOS (22)  ESPN
119   Wed., 4/18/12  10:30 PM    LAL (23) at GS (8)  ESPN
120   Thu., 4/19/12      8 PM  CHI (21) at MIA (24)   TNT
121   Thu., 4/19/12  10:30 PM   LAC (13) at PHX (8)   TNT
122   Fri., 4/20/12      7 PM   BOS (23) at ATL (6)  ESPN
123   Fri., 4/20/12   9:30 PM    LAL (24) at SA (7)  ESPN
124   Sat., 4/21/12   5:30 PM    DEN (9) at PHX (9)  ESPN
125   Sat., 4/21/12      8 PM  DAL (20) at CHI (22)  ESPN
126   Sat., 4/21/12  10:30 PM   ORL (16) at UTA (8)  ESPN
127   Sun., 4/22/12      1 PM    NY (20) at ATL (7)  ESPN
128   Sun., 4/22/12   3:30 PM  OKC (17) at LAL (25)   ABC
129   Tue., 4/24/12      8 PM  MIA (25) at BOS (24)   TNT
130   Tue., 4/24/12  10:30 PM      NO (2) at GS (9)   TNT
131   Wed., 4/25/12      8 PM   LAC (14) at NY (21)  ESPN
132   Wed., 4/25/12  10:30 PM    SA (8) at PHX (10)  ESPN
133   Thu., 4/26/12      8 PM    NY (22) at CHA (1)   TNT
134   Thu., 4/26/12  10:30 PM     SA (9) at GS (10)   TNT

[135 rows x 4 columns]
© www.soinside.com 2019 - 2024. All rights reserved.