问题:如何在Python中抓取动态加载的数据表?

问题描述 投票:0回答:1

这里是Python新手。我一直在学习如何从各种棒球网站(Fangraphs、Statcast、Rotowire)进行抓取。我通过几种不同的方法取得了成功,但 Statcast 上的公园因素表给我带来了问题。我尝试过使用 Selenium,并且尝试在我的计算机上保存该站点的本地 html 副本以进行练习,而无需重复向 Statcast 服务器发送请求。下面的脚本确实在 URL 上抓取了一个表格,但我认为这是第一个,只是一个游戏得分。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
from io import StringIO

# Set your user agent information
headers = {
    "User-Agent": "FirstName LastName <[email protected]>"
}

# Set the URL of the webpage containing the park factors
url = "https://baseballsavant.mlb.com/leaderboard/statcast-park-factors?type=year&year=2024&       
batSide=L&stat=index_wOBA&condition=All&rolling="

# Initialize the WebDriver (e.g., for Firefox)
driver = webdriver.Firefox()

# Navigate to the URL
driver.get(url)

# Wait for the table to be loaded (adjust the timeout as needed)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "table")))

# Extract the table data
table = driver.find_element(By.TAG_NAME, "table")
table_html = table.get_attribute("outerHTML")

# Use pandas to read the HTML table
df = pd.read_html(StringIO(table_html))[0]

# Close the WebDriver
driver.quit()

# Display the DataFrame
print(df)

我想抓取更大的“公园因素”表,其中列出了所有球队体育场,并包含“wOBACon”和“BACON”等元素的统计数据。我通过引用“table”标签尝试过此操作,但它似乎从未识别该表。我尝试对表建立索引,但它向我发送一个错误,指出我的索引超出范围。我还尝试使用 ID 而不是 TAG_NAME 并引用“parkFactors”,但无济于事。它只是说它找不到具有该属性的对象(无法识别存在的表)。我试图通过增加隐式等待动态加载表加载的时间长度来解决这个问题,但没有成功)。我还尝试引用类“article-template”和“table-savant”标签,但没有成功。非常感谢任何帮助!

python pandas class selenium-webdriver
1个回答
0
投票

数据位于

<script>
元素内,因此要获取它,您可以使用
re
/
json
模块:

import json
import re

import pandas as pd
import requests

url = "https://baseballsavant.mlb.com/leaderboard/statcast-park-factors?type=year&year=2024&%20%20%20%20%20%20%20batSide=L&stat=index_wOBA&condition=All&rolling="

response = requests.get(url)
data = re.search(r"data = (.*);", response.text).group(1)
data = json.loads(data)
df = pd.DataFrame(data)

# print(df)
df["index_woba"] = df["index_woba"].astype(int)

out = df[["venue_name", "index_woba"]].sort_values(
    by=["index_woba", "venue_name"], ascending=[False, True]
)
print(out)

打印:

                     venue_name  index_woba
10                  Coors Field         113
24             Globe Life Field         105
14     Great American Ball Park         105
4              Kauffman Stadium         105
26               Nationals Park         103
7                 Rogers Centre         102
20                  Truist Park         102
0                 Angel Stadium         101
15                Busch Stadium         101
22           Citizens Bank Park         101
9                 Wrigley Field         101
11               Dodger Stadium         100
17             Minute Maid Park         100
12                     PNC Park         100
27                 Target Field         100
16               loanDepot park         100
8                   Chase Field          99
18                Comerica Park          99
2         Guaranteed Rate Field          99
1   Oriole Park at Camden Yards          99
28               Yankee Stadium          99
13        American Family Field          98
5              Oakland Coliseum          97
6               Tropicana Field          97
25                   Citi Field          96
19                  Oracle Park          96
3             Progressive Field          96
21                   Petco Park          95
23                T-Mobile Park          93
© www.soinside.com 2019 - 2024. All rights reserved.