如何使用 Cheerio 在脚本标签内获取 JSON 数据

问题描述 投票:0回答:1

我正在尝试从一堆网站中抓取元数据。对于大多数人来说,使用 Cheerio 来获得像

$('meta[property="article:published_time"]').attr('content')
这样的东西效果很好。然而,对于其他人来说,这个元数据属性没有明确定义,但数据以某种形式存在于 HTML 中。

例如,如果我抓取此页面,则没有

published_time
元数据属性,但此文本存在于文件中...

{"@context":"http://schema.org","@type":"NewsArticle","mainEntityOfPage":"https://news.yahoo.com/venezuela-deploys-soldiers-face-guyana-175722970.html","headline":"Venezuela Deploys Troops to East Caribbean Coast, Citing Guyana Threat","datePublished":"2023-12-28T19:53:10.000Z","dateModified":"2023-12-28T19:53:10.000Z","keywords":["Nicolas Maduro","Venezuela","Bloomberg","Guyana","Essequibo","Exxon Mobil Corp"],"description":"(Bloomberg) -- Venezuela has decided to deploy more than 5,000 soldiers on its eastern Caribbean coast after neighboring Guyana received a warship from the...","publisher":{"@type":"Organization","name":"Yahoo News","logo":{"@type":"ImageObject","url":"https://s.yimg.com/rz/p/yahoo_news_en-US_h_p_news_2.png","width":310,"height":50},"url":"https://news.yahoo.com/"},"author":{"@type":"Person","name":"Andreina Itriago Acosta","url":"","jobTitle":""},"creator":{"@type":"Person","name":"Andreina Itriago Acosta","url":"","jobTitle":""},"provider":{"@type":"Organization","name":"Bloomberg","url":"https://www.bloomberg.com/","logo":{"@type":"ImageObject","width":339,"height":100,"url":"https://s.yimg.com/cv/apiv2/hlogos/bloomberg_Light.png"}},"image":{"@type":"ImageObject","url":"https://s.yimg.com/ny/api/res/1.2/hs3Vjof2BqloeagLdsvfDw--/YXBwaWQ9aGlnaGxhbmRlcjt3PTEyMDA7aD0xMjAy/https://media.zenfs.com/en/bloomberg_politics_602/2db14d66c52bec70cb0ec6d0553968c6","width":1200,"height":1202},"thumbnailUrl":"https://s.yimg.com/ny/api/res/1.2/hs3Vjof2BqloeagLdsvfDw--/YXBwaWQ9aGlnaGxhbmRlcjt3PTEyMDA7aD0xMjAy/https://media.zenfs.com/en/bloomberg_politics_602/2db14d66c52bec70cb0ec6d0553968c6"}

此对象中有一个

"datePublished"
字段。我如何通过 Cheerio 获得此房产?

javascript json web-scraping cheerio
1个回答
0
投票

您想要的数据是 JSON 格式,位于

<script>
标签内。为了查找数据,我会选择所有
<script>
标签,然后循环遍历它们以找到带有
'"datePublished":'
子字符串的标签,提取文本,通过
JSON.parse()
运行它,最后访问
.datePublished
属性:

const cheerio = require("cheerio"); // ^1.0.0-rc.12

const url = "<Your URL>";

fetch(url)
  .then(res => {
    if (!res.ok) {
      throw Error(res.statusText);
    }

    return res.text();
  })
  .then(html => {
    const $ = cheerio.load(html);
    const el = [...$("script")].find(e =>
      $(e).text().includes('"datePublished":')
    );
    const meta = JSON.parse($(el).text()); // => 2023-12-28T19:53:10.000Z
    console.log(meta.datePublished);
  })
  .catch(err => console.error(err));

请参阅这篇文章,了解有关此特定技术的一般教程。它是用 Python 编写的,但相同的概念也适用于 Node。有时

<script>
中的 JSON 是一个 JS 对象,或者分配给一个变量,这使得解析比这里的简单场景更棘手,通常需要一些正则表达式或 JSON5 来解析。请参阅 this answer,了解使用 Cheerio 从
<script>
标签解析数据的更复杂示例。

© www.soinside.com 2019 - 2024. All rights reserved.