我想抓取https://www.rotowire.com/baseball/news.php,其中包含有关 MLB 球员的新闻,并将数据保存为表格格式,如下所示:
日期 | 玩家 | 标题 | 新闻 | |
---|---|---|---|---|
4/17 | 阿布纳·乌里韦 | 取得第二场胜利 | 乌里韦(2-1)在周三对阵教士队的比赛中赢得了胜利,他在第八局没有得分的情况下允许安打和无保送。他有一次三振。 | |
4/17 | 里奇·帕拉西奥斯 | 休息一天 vs 左撇子 | 帕拉西奥斯缺席周三对阵天使队的比赛的阵容。 | |
我很难理解如何将每个内容隔离到数据框中自己的行中。寻求任何帮助来实现这一目标。理想情况下,我每 5 分钟刮擦一次,并保持桌子不断增长。
要将该页面的所有信息获取到数据框中,您可以使用下一个示例:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.rotowire.com/baseball/news.php"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for n in soup.select(".news-update"):
name = n.a.text
h = n.select_one(".news-update__headline").text
dt = n.select_one(".news-update__timestamp").text
news = n.select_one(".news-update__news").text
all_data.append({"Name": name, "Headline": h, "Date": dt, "News": news})
df = pd.DataFrame(all_data)
print(df.head())
打印:
Name Headline Date News
0 Joe Jacques Recalled from Triple-A April 17, 2024 Jacques was recalled from Triple-A Worcester by the Red Sox on Wednesday, Mac Cerullo of the Boston Herald reports.
1 Cedric Mullins Walks off Twins April 17, 2024 Mullins went 1-for-4 with a walk-off, two-run home run during Wednesday's 4-2 win against the Twins.
2 Garrett Whitlock Lands on injured list April 17, 2024 Whitlock was placed on the 15-day injured list by the Red Sox on Wednesday with a left oblique strain, Mac Cerullo of the Boston Herald reports.
3 Eli Morgan Shelved with shoulder inflammation April 17, 2024 The Guardians placed Morgan on the 15-day injured list Wednesday with right shoulder inflammation, Joe Noga of The Cleveland Plain Dealer reports.
4 Craig Kimbrel Earns third win April 17, 2024 Kimbel (3-0) earned the win Wednesday against the Twins after he retired all three batters he faced in the ninth inning. He had one strikeout.
注意:我建议将所有这些信息放入 SQL 数据库(例如 SQLite - 它包含在 python 中,不插入任何重复项)并设置 cronjob 每 5 分钟运行此脚本。