从多个URL刮除多个页面上的表数据(Python和BeautifulSoup)

问题描述 投票:0回答:1

这里有新编码器!我正在尝试从多个URL抓取Web表数据。每个URL网页都有1个表,但是该表分为多个页面。我的代码仅遍历第一个URL的表页面,而不遍历其余页面。所以..我只能获得2000年NBA数据的1-5页,但它止于此。如何获取我的代码以提取每年的数据?非常感谢您的帮助。

page = 1
year = 2000

while page < 20 and year < 2020:
  base_URL = 'http://www.espn.com/nba/salaries/_/year/{}/page/{}'.format(year,page) 
  response = requests.get(base_URL, headers)


if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    sal_table = soup.find_all('table', class_ = 'tablehead')
    if len(sal_table) < 2:
        sal_table = sal_table[0]
        with open ('NBA_Salary_2000_2019.txt', 'a') as r:
            for row in sal_table.find_all('tr'):
                for cell in row.find_all('td'):
                    r.write(cell.text.ljust(30))
                r.write('\n')
        page+=1
    else:
        print("too many tables")
else:
    year +=1
    page = 1
python web web-scraping beautifulsoup jupyter-notebook
1个回答
0
投票

[我在这里考虑将熊猫用作1)它的.read_html()函数(在后台使用beautifulsoup),更易于解析<table>标签,以及2)可以轻松地将其直接写入文件。

而且,要遍历20页也是一种浪费(例如,您之后的第一个季节只有4页...其余都是空白。因此,我考虑添加一些内容,直到它到达空白表,然后移动进入下一个季节。

import pandas as pd
import requests

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'}

results = pd.DataFrame()
year = 2000

while year < 2020:
    goToNextPage = True
    page = 1
    while goToNextPage == True:
        base_URL = 'http://www.espn.com/nba/salaries/_/year/{}/page/{}'.format(year,page) 
        response = requests.get(base_URL, headers)
        if response.status_code == 200:
            temp_df = pd.read_html(base_URL)[0]
            temp_df.columns = list(temp_df.iloc[0,:])
            temp_df = temp_df[temp_df['RK'] != 'RK']

            if len(temp_df) == 0:
                goToNextPage = False
                year +=1
                continue


            print ('Aquiring Season: %s\tPage: %s' %(year, page))

            temp_df['Season'] = '%s-%s' %(year-1, year)

            results = results.append(temp_df, sort=False).reset_index(drop=True)

            page+=1


results.to_csv('c:/test/NBA_Salary_2000_2019.csv', index=False)

输出:

print (results.head(25).to_string())
    RK                     NAME                    TEAM       SALARY     Season
0    1      Shaquille O'Neal, C      Los Angeles Lakers  $17,142,000  1999-2000
1    2        Kevin Garnett, PF  Minnesota Timberwolves  $16,806,000  1999-2000
2    3       Alonzo Mourning, C              Miami Heat  $15,004,000  1999-2000
3    4         Juwan Howard, PF      Washington Wizards  $15,000,000  1999-2000
4    5       Scottie Pippen, SF  Portland Trail Blazers  $14,795,000  1999-2000
5    6          Karl Malone, PF               Utah Jazz  $14,000,000  1999-2000
6    7         Larry Johnson, F         New York Knicks  $11,910,000  1999-2000
7    8          Gary Payton, PG     Seattle SuperSonics  $11,020,000  1999-2000
8    9      Rasheed Wallace, PF  Portland Trail Blazers  $10,800,000  1999-2000
9   10            Shawn Kemp, C     Cleveland Cavaliers  $10,780,000  1999-2000
10  11     Damon Stoudamire, PG  Portland Trail Blazers  $10,125,000  1999-2000
11  12      Antonio McDyess, PF          Denver Nuggets   $9,900,000  1999-2000
12  13       Antoine Walker, PF          Boston Celtics   $9,000,000  1999-2000
13  14  Shareef Abdur-Rahim, PF     Vancouver Grizzlies   $9,000,000  1999-2000
14  15        Allen Iverson, SG      Philadelphia 76ers   $9,000,000  1999-2000
15  16            Vin Baker, PF     Seattle SuperSonics   $9,000,000  1999-2000
16  17            Ray Allen, SG         Milwaukee Bucks   $9,000,000  1999-2000
17  18    Anfernee Hardaway, SF            Phoenix Suns   $9,000,000  1999-2000
18  19          Kobe Bryant, SF      Los Angeles Lakers   $9,000,000  1999-2000
19  20      Stephon Marbury, PG         New Jersey Nets   $9,000,000  1999-2000
20  21           Vlade Divac, C        Sacramento Kings   $8,837,000  1999-2000
21  22         Bryant Reeves, C     Vancouver Grizzlies   $8,666,000  1999-2000
22  23        Tom Gugliotta, PF            Phoenix Suns   $8,558,000  1999-2000
23  24        Nick Van Exel, PG          Denver Nuggets   $8,354,000  1999-2000
24  25        Elden Campbell, C       Charlotte Hornets   $7,975,000  1999-2000
...
© www.soinside.com 2019 - 2024. All rights reserved.