刮取有限制的桌子

问题描述 投票:0回答:1

有一个网站包含从 API 中提取的数据。每页最多可以有 100 行。如果您要检查第 1、2、3 等页面的 API URL,它们每次都会发生变化。到目前为止,我每次都采用相同的脚本,只是切换了 URL,但每次我还必须将其保存在不同的 excel 文件中,否则它会删除数据。

我希望有一个脚本能够从该表中提取所有信息,然后将它们全部放入同一张纸上的 Excel 中,而不会覆盖值。

我使用的主页是http://www.nhl.com/stats/teams?aggregate=0&report=days Betweengames&reportType=game&dateFrom=2021-10-12&dateTo=2021-11-30&gameType=2&filter=gamesPlayed,gte, 1&sort=a_teamFullName,daysRest&page=0&pageSize=50 但请记住,该页面上的所有信息都是从 API 中提取的。

这是我正在使用的代码:

import requests
import json
import pandas as pd
url = ('https://api.nhle.com/stats/rest/en/team/daysbetweengames? `isAggregate=false&isGame=true&sort=%5B%7B%22property%22:%22teamFullName%22,%22direction%22:%22ASC%22%7D,%7B%22property%22:%22daysRest%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22teamId%22,%22direction%22:%22ASC%22%7D%5D&start=0&limit=500&factCayenneExp=gamesPlayed%3E=1&cayenneExp=gameDate%3C=%222021-11-30%2023%3A59%3A59%22%20and%20gameDate%3E=%222021-10-12%22%20and%20gameTypeId=2')`

resp = requests.get(url).text
resp = json.loads(resp)
df = pd.DataFrame(resp['data'])
df.to_excel('Master File.xlsx', sheet_name = 'Info')

任何帮助将不胜感激。

python pandas web-scraping limit
1个回答
0
投票

url
start=...
- 所以你可以使用
for
循环并替换这个值
0
100
200
等并为不同的url运行代码,以及
append()
DataFrame 

如果将 url 中的所有参数(在 char

?
之后)放入字典中,然后以
get(url, params=...)

的形式运行,会更简单

requests
response.json
所以不需要
json.loads(response.text)

import requests
import pandas as pd

# --- before loop ---

url = 'https://api.nhle.com/stats/rest/en/team/daysbetweengames'

payload = {
    'isAggregate': 'false',
    'isGame': 'true',
    'start': 0,
    'limit': 100,
    'sort': '[{"property":"teamFullName","direction":"ASC"},{"property":"daysRest","direction":"DESC"},{"property":"teamId","direction":"ASC"}]',
    'factCayenneExp': 'gamesPlayed>=1',
    'cayenneExp': 'gameDate<="2021-11-30 23:59:59" and gameDate>="2021-10-12" and gameTypeId=2',
}

df = pd.DataFrame()

# --- loop ---

for start in range(0, 1000, 100):
    print('start:', start)
    
    payload['start'] = start
    
    response = requests.get(url, params=payload)

    data = response.json()

    df = df.append(data['data'], ignore_index=True)
    
# --- after loop ---

print(df)

df.to_excel('Master File.xlsx', sheet_name='Info')

结果:

     daysRest  faceoffWinPct    gameDate  ...  ties  timesShorthandedPerGame  wins
0           4        0.47169  2021-10-13  ...  None                      5.0     1
1           3        0.50847  2021-11-22  ...  None                      4.0     0
2           2        0.45762  2021-10-26  ...  None                      1.0     0
3           2        0.56666  2021-11-05  ...  None                      2.0     1
4           2        0.54716  2021-11-14  ...  None                      1.0     1
..        ...            ...         ...  ...   ...                      ...   ...
675         1        0.37209  2021-10-28  ...  None                      2.0     1
676         1        0.48000  2021-10-21  ...  None                      3.0     1
677         0        0.57692  2021-11-06  ...  None                      1.0     0
678         0        0.32727  2021-11-19  ...  None                      3.0     0
679         0        0.47169  2021-11-27  ...  None                      4.0     1
© www.soinside.com 2019 - 2024. All rights reserved.