使用Python进行网页抓取时如何删除熊猫数据框中的字符?

问题描述 投票:1回答:3

[我正在尝试使用Python 3将本网站的表格抓取到.csv文件中:2011-2012 NBA National Schedule

表格开始时像:

                Revised Schedule                    Original Schedule

Date            Time      Game                Net   Time      Game                  Net
Sun., 12/25/11  12 PM     BOS (1) at NY (1)   TNT   12 PM     BOS (7) at NY (7)     ESPN
Sun., 12/25/11  2:30 PM   MIA (1) at DAL (1)  ABC   2:30 PM   MIA (8) at DAL (5)    ABC
Sun., 12/25/11  5 PM      CHI (1) at LAL (1)  ABC   5 PM      CHI (6) at LAL (9)    ABC
Sun., 12/25/11  8 PM      ORL (1) at OKC (1)  ESPN  no game   no game               no game
Sun., 12/25/11  10:30 PM  LAC (1) at GS (1)   ESPN  no game   no game               no game
Tue., 12/27/11  8 PM      BOS (2) at MIA (2)  TNT   no game   no game               no game
Tue., 12/27/11  10:30 PM  UTA (1) at LAL (2)  TNT   no game   no game               no game

我只对前四列的修订时间表感兴趣。我想要的.csv文件中的输出如下所示:

Output in .csv File

我正在使用这些软件包:

import re
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from itertools import groupby

这是我为匹配所需格式所做的代码:

df = pd.read_html("https://www.sportsmediawatch.com/2011/12/revised-2011-12-nba-national-tv-schedule/", header=0)[0]

revisedCols = ['Date'] + [ col for col in df.columns if 'Revised' in col ]
df = df[revisedCols]

df.columns = df.iloc[0,:]

df = df.iloc[1:,:].reset_index(drop=True)


# Format Date to m/d/y
df['Date'] = np.where(df.Date.str.startswith(('10/', '11/', '12/')), df.Date + ' 11', df.Date + ' 12')
df['Date']=pd.to_datetime(df['Date'])
df['Date']=df['Date'].dt.strftime('%m/%d/%Y')

# Split the Game column
df[['Away','Home']] = df.Game.str.split('at',expand=True)   


# Final dataframe with desired columns
df = df[['Date','Time','Away','Home','Net']]

df.columns = ['Date', 'Time', 'Away', 'Home', 'Network']

print(df)

输出:

           Date      Time      Away        Home Network
0    12/25/2011     12 PM   BOS (1)      NY (1)     TNT
1    12/25/2011   2:30 PM   MIA (1)     DAL (1)     ABC
2    12/25/2011      5 PM   CHI (1)     LAL (1)     ABC
3    12/25/2011      8 PM   ORL (1)     OKC (1)    ESPN
4    12/25/2011  10:30 PM   LAC (1)      GS (1)    ESPN
5    12/27/2011      8 PM   BOS (2)     MIA (2)     TNT
6    12/27/2011  10:30 PM   UTA (1)     LAL (2)     TNT

我注意到客队和主队列中每个球队名称旁边都有(1),(2)等。 我如何实施刮板程序以删除客队和主队列中每个球队名称旁边的(1),(2)等?

python pandas web-scraping beautifulsoup screen-scraping
3个回答
1
投票
您可以将str.replace与括号和数字一起使用,并且在str.replace的开头或结尾处似乎都有一些空格:

str.strip


0
投票
您可以在拆分游戏列之后添加此代码

str.strip


0
投票
df['Away'] = df['Away'].str.replace('\(\d*\)', '').str.strip() df['Home'] = df['Home'].str.replace('\(\d*\)', '').str.strip() print (df.head()) Date Time Away Home Network 0 12/25/2011 12 PM BOS NY TNT 1 12/25/2011 2:30 PM MIA DAL ABC 2 12/25/2011 5 PM CHI LAL ABC 3 12/25/2011 8 PM ORL OKC ESPN 4 12/25/2011 10:30 PM LAC GS ESPN
© www.soinside.com 2019 - 2024. All rights reserved.