更新由于 javascript 块而不再工作的 python scraper

问题描述 投票:0回答:1

所以我写了一个爬虫(按计划部署),它运行得很好,直到最近网站(纽约时报)做出的更改破坏了它

本质上,抓取工具的工作原理是访问文章 URL,并使用 xpath 提取完整的文章内容,我将其传递给法学硕士以对其进行总结

这是代码:

--

import requests
from scrapy.selector import Selector

url ='https://www.nytimes.com/2024/02/06/us/politics/border-ukraine-israel-aid-congress.html' #works with any nytimes article url

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9', 
    'Accept-Encoding': 'gzip, deflate, br', 
    'DNT': '1',  
    'Connection': 'keep-alive',  
    'Upgrade-Insecure-Requests': '1',  
}
response = requests.request("GET", url, headers=headers)
# response

sel = Selector(text=response.text)
b = sel.xpath('//section[@class="meteredContent css-1r7ky0e"]')   
art =  b.xpath('.//p[@class="css-at9mc1 evys1bk0"]').extract()  #article body
art2 = '\n '.join(art) #newline   
art2 = html2text.html2text(art2)   #convert from html to human/LLM readable text
print(art2)
#pass to an LLM via api

--

以前,上面的代码将返回整篇文章。现在,它返回部分文章,因为应用程序会抛出一个 JavaScript 屏幕,要求在呈现完整文章之前进行人工验证

我有两个问题:

  1. 这是一个相当高频的调用,所以我可以做一些聪明的事情来绕过这个限制,而不需要合并涉及通过浏览器为每个调用渲染 JavaScript 的繁重堆栈,例如使用隐藏的 API 端点或合并标头值来表明此调用来自人类?
  2. 如果 1 的答案是“否”,那么为此类网站呈现 javascript 并抓取它的最简单、最轻量级的库、包和方法是什么?我在一个非常轻量级的服务器上运行这个脚本,所以我真的想尝试而不是增加运行此代码所需的内存/基础设施要求

非常感谢

python web-scraping python-requests
1个回答
0
投票

尝试:

import json
import re

import requests

url = "https://www.nytimes.com/2024/02/06/us/politics/border-ukraine-israel-aid-congress.html"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:122.0) Gecko/20100101 Firefox/122.0"
}

html_text = requests.get(url, headers=headers).text
data = re.search(r"window\.__preloadedData = (.*}});", html_text).group(1)
data = data.replace(":undefined", ":null")
data = json.loads(data)

# print(json.dumps(data, indent=4))

headline = None
summary = None
content = []

for c in data["initialData"]["data"]["article"]["sprinkledBody"]["content"]:
    match c["__typename"]:
        case "HeaderBasicBlock":
            headline = c["headline"]["content"][0]["text"]
            summary = c["summary"]["content"][0]["text"]
        case "ParagraphBlock":
            t = "".join(cc["text"] for cc in c["content"])
            content.append(t)

print(headline)
print()
print(summary)
print("-" * 80)
print("\n\n".join(content))

打印:

With Demise of Border Deal, No Clear Path for Ukraine and Israel Aid in Congress

The House was set to consider an Israel-only aid bill that faced bipartisan resistance and a veto threat from President Biden. The Senate’s broader package with Ukraine aid appeared dead.
--------------------------------------------------------------------------------
The decision by Republicans in Congress to torpedo a bipartisan border deal they demanded has left the fate of aid to Ukraine and Israel in peril, closing off what had been seen as the best remaining avenue on Capitol Hill for approval of critical military aid to American allies.

The political paralysis in the face of pleas from President Biden, lawmakers in both parties and leaders around the world for quick action raised immediate questions about whether Congress would be able to salvage the emergency aid package — and if so, how.

...
© www.soinside.com 2019 - 2024. All rights reserved.