使用 python 将 rotowire MLB 球员新闻和表格抓取到表格中

问题描述 投票:0回答:1

我想抓取https://www.rotowire.com/baseball/news.php,其中包含有关 MLB 球员的新闻,并将数据保存为表格格式,如下所示:

日期 玩家 标题 新闻
4/17 阿布纳·乌里韦 取得第二场胜利 乌里韦(2-1)在周三对阵教士队的比赛中赢得了胜利,他在第八局没有得分的情况下允许安打和无保送。他有一次三振。
4/17 里奇·帕拉西奥斯 休息一天 vs 左撇子 帕拉西奥斯缺席周三对阵天使队的比赛的阵容。

我很难理解如何将每个内容隔离到数据框中自己的行中。寻求任何帮助来实现这一目标。理想情况下,我每 5 分钟刮擦一次,并保持桌子不断增长。

python beautifulsoup scrape
1个回答
0
投票

要将该页面的所有信息获取到数据框中,您可以使用下一个示例:

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "https://www.rotowire.com/baseball/news.php"

soup = BeautifulSoup(requests.get(url).content, "html.parser")

all_data = []
for n in soup.select(".news-update"):
    name = n.a.text
    h = n.select_one(".news-update__headline").text
    dt = n.select_one(".news-update__timestamp").text
    news = n.select_one(".news-update__news").text
    all_data.append({"Name": name, "Headline": h, "Date": dt, "News": news})

df = pd.DataFrame(all_data)
print(df.head())

打印:

               Name                            Headline            Date                                                                                                                                                News
0       Joe Jacques              Recalled from Triple-A  April 17, 2024                                 Jacques was recalled from Triple-A Worcester by the Red Sox on Wednesday, Mac Cerullo of the Boston Herald reports.
1    Cedric Mullins                     Walks off Twins  April 17, 2024                                                Mullins went 1-for-4 with a walk-off, two-run home run during Wednesday's 4-2 win against the Twins.
2  Garrett Whitlock               Lands on injured list  April 17, 2024    Whitlock was placed on the 15-day injured list by the Red Sox on Wednesday with a left oblique strain, Mac Cerullo of the Boston Herald reports.
3        Eli Morgan  Shelved with shoulder inflammation  April 17, 2024  The Guardians placed Morgan on the 15-day injured list Wednesday with right shoulder inflammation, Joe Noga of The Cleveland Plain Dealer reports.
4     Craig Kimbrel                     Earns third win  April 17, 2024      Kimbel (3-0) earned the win Wednesday against the Twins after he retired all three batters he faced in the ninth inning. He had one strikeout.

注意:我建议将所有这些信息放入 SQL 数据库(例如 SQLite - 它包含在 python 中,不插入任何重复项)并设置 cronjob 每 5 分钟运行此脚本。

© www.soinside.com 2019 - 2024. All rights reserved.