无法在WSJ页面上的“ div”类中抓取数据

Question

我正在尝试从WSJ网站上的文章中抓取文本内容。例如考虑以下html来源：

<div class="article-content ">
       <p>BEIRUT—
      Carlos Ghosn, 
       who is seeking to clear his name in Lebanon, would face a very different path to vindication here, where endemic corruption and the former auto executive’s widespread popularity could influence the outcome of a potential trial. </p> <p>Mr. Ghosn, the former chief of auto makers

我正在使用以下代码：

res = requests.get(url)
html = BeautifulSoup(res.text, "lxml")
classid = "article-content "
item = html.find_all("div", {"class":classid})

这将返回一个空项目。我看到了一些其他的帖子，其中一些人提出了adding delays和others的建议，但在我的情况下却不起作用。计划在某些ML项目中使用抓取的文本。

我已经订阅了WSJ，并且在运行上述脚本时已登录。

任何帮助，我们将不胜感激！谢谢

Answer 1

您的代码对我来说很好。只要确保您正在搜索正确的“ classid”即可。我认为这不会有所作为，但是您可以尝试使用它作为替代：

无法在WSJ页面上的“ div”类中抓取数据

问题描述投票：1回答：1

1个回答

最新问题

无法在WSJ页面上的“ div”类中抓取数据

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1