所以我写了一个爬虫(按计划部署),它运行得很好,直到最近网站(纽约时报)做出的更改破坏了它
本质上,抓取工具的工作原理是访问文章 URL,并使用 xpath 提取完整的文章内容,我将其传递给法学硕士以对其进行总结
这是代码:
--
import requests
from scrapy.selector import Selector
url ='https://www.nytimes.com/2024/02/06/us/politics/border-ukraine-israel-aid-congress.html' #works with any nytimes article url
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
}
response = requests.request("GET", url, headers=headers)
# response
sel = Selector(text=response.text)
b = sel.xpath('//section[@class="meteredContent css-1r7ky0e"]')
art = b.xpath('.//p[@class="css-at9mc1 evys1bk0"]').extract() #article body
art2 = '\n '.join(art) #newline
art2 = html2text.html2text(art2) #convert from html to human/LLM readable text
print(art2)
#pass to an LLM via api
--
以前,上面的代码将返回整篇文章。现在,它返回部分文章,因为应用程序会抛出一个 JavaScript 屏幕,要求在呈现完整文章之前进行人工验证
我有两个问题:
非常感谢
尝试:
import json
import re
import requests
url = "https://www.nytimes.com/2024/02/06/us/politics/border-ukraine-israel-aid-congress.html"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:122.0) Gecko/20100101 Firefox/122.0"
}
html_text = requests.get(url, headers=headers).text
data = re.search(r"window\.__preloadedData = (.*}});", html_text).group(1)
data = data.replace(":undefined", ":null")
data = json.loads(data)
# print(json.dumps(data, indent=4))
headline = None
summary = None
content = []
for c in data["initialData"]["data"]["article"]["sprinkledBody"]["content"]:
match c["__typename"]:
case "HeaderBasicBlock":
headline = c["headline"]["content"][0]["text"]
summary = c["summary"]["content"][0]["text"]
case "ParagraphBlock":
t = "".join(cc["text"] for cc in c["content"])
content.append(t)
print(headline)
print()
print(summary)
print("-" * 80)
print("\n\n".join(content))
打印:
With Demise of Border Deal, No Clear Path for Ukraine and Israel Aid in Congress
The House was set to consider an Israel-only aid bill that faced bipartisan resistance and a veto threat from President Biden. The Senate’s broader package with Ukraine aid appeared dead.
--------------------------------------------------------------------------------
The decision by Republicans in Congress to torpedo a bipartisan border deal they demanded has left the fate of aid to Ukraine and Israel in peril, closing off what had been seen as the best remaining avenue on Capitol Hill for approval of critical military aid to American allies.
The political paralysis in the face of pleas from President Biden, lawmakers in both parties and leaders around the world for quick action raised immediate questions about whether Congress would be able to salvage the emergency aid package — and if so, how.
...