这里是Python新手。我一直在学习如何从各种棒球网站(Fangraphs、Statcast、Rotowire)进行抓取。我通过几种不同的方法取得了成功,但 Statcast 上的公园因素表给我带来了问题。我尝试过使用 Selenium,并且尝试在我的计算机上保存该站点的本地 html 副本以进行练习,而无需重复向 Statcast 服务器发送请求。下面的脚本确实在 URL 上抓取了一个表格,但我认为这是第一个,只是一个游戏得分。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
from io import StringIO
# Set your user agent information
headers = {
"User-Agent": "FirstName LastName <[email protected]>"
}
# Set the URL of the webpage containing the park factors
url = "https://baseballsavant.mlb.com/leaderboard/statcast-park-factors?type=year&year=2024&
batSide=L&stat=index_wOBA&condition=All&rolling="
# Initialize the WebDriver (e.g., for Firefox)
driver = webdriver.Firefox()
# Navigate to the URL
driver.get(url)
# Wait for the table to be loaded (adjust the timeout as needed)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "table")))
# Extract the table data
table = driver.find_element(By.TAG_NAME, "table")
table_html = table.get_attribute("outerHTML")
# Use pandas to read the HTML table
df = pd.read_html(StringIO(table_html))[0]
# Close the WebDriver
driver.quit()
# Display the DataFrame
print(df)
我想抓取更大的“公园因素”表,其中列出了所有球队体育场,并包含“wOBACon”和“BACON”等元素的统计数据。我通过引用“table”标签尝试过此操作,但它似乎从未识别该表。我尝试对表建立索引,但它向我发送一个错误,指出我的索引超出范围。我还尝试使用 ID 而不是 TAG_NAME 并引用“parkFactors”,但无济于事。它只是说它找不到具有该属性的对象(无法识别存在的表)。我试图通过增加隐式等待动态加载表加载的时间长度来解决这个问题,但没有成功)。我还尝试引用类“article-template”和“table-savant”标签,但没有成功。非常感谢任何帮助!
数据位于
<script>
元素内,因此要获取它,您可以使用 re
/json
模块:
import json
import re
import pandas as pd
import requests
url = "https://baseballsavant.mlb.com/leaderboard/statcast-park-factors?type=year&year=2024&%20%20%20%20%20%20%20batSide=L&stat=index_wOBA&condition=All&rolling="
response = requests.get(url)
data = re.search(r"data = (.*);", response.text).group(1)
data = json.loads(data)
df = pd.DataFrame(data)
# print(df)
df["index_woba"] = df["index_woba"].astype(int)
out = df[["venue_name", "index_woba"]].sort_values(
by=["index_woba", "venue_name"], ascending=[False, True]
)
print(out)
打印:
venue_name index_woba
10 Coors Field 113
24 Globe Life Field 105
14 Great American Ball Park 105
4 Kauffman Stadium 105
26 Nationals Park 103
7 Rogers Centre 102
20 Truist Park 102
0 Angel Stadium 101
15 Busch Stadium 101
22 Citizens Bank Park 101
9 Wrigley Field 101
11 Dodger Stadium 100
17 Minute Maid Park 100
12 PNC Park 100
27 Target Field 100
16 loanDepot park 100
8 Chase Field 99
18 Comerica Park 99
2 Guaranteed Rate Field 99
1 Oriole Park at Camden Yards 99
28 Yankee Stadium 99
13 American Family Field 98
5 Oakland Coliseum 97
6 Tropicana Field 97
25 Citi Field 96
19 Oracle Park 96
3 Progressive Field 96
21 Petco Park 95
23 T-Mobile Park 93