我经常遇到的问题是,由于注意力不集中,我无法阅读路透社的整篇文章。因此,我采用了此脚本,对其进行了调整以阅读新闻文章,并使用 AWS Polly 将其传输到音频文件中。从几天前开始,HTML 无法被接收,我得到了一个 HTTP 代码 401,并有以下日志:
2023-12-15 10:15:44,621 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): www.reuters.com:80
2023-12-15 10:15:44,632 - urllib3.connectionpool - DEBUG - http://www.reuters.com:80 "GET /technology/cybersecurity/ukraine-says-russian-intelligence-linked-hackers-claim-cyberattack-mobile-2023-12-13 HTTP/1.1" 301 167
2023-12-15 10:15:44,634 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): www.reuters.com:443
2023-12-15 10:15:44,709 - urllib3.connectionpool - DEBUG - https://www.reuters.com:443 "GET /technology/cybersecurity/ukraine-says-russian-intelligence-linked-hackers-claim-cyberattack-mobile-2023-12-13 HTTP/1.1" 401 582
2023-12-15 10:15:44,712 - root - DEBUG - Error code 401 http://www.reuters.com/technology/cybersecurity/ukraine-says-russian-intelligence-linked-hackers-claim-cyberattack-mobile-2023-12-13
2023-12-15 10:15:44,712 - root - DEBUG - Failed to retrieve the web page at URL: http://www.reuters.com/technology/cybersecurity/ukraine-says-russian-intelligence-linked-hackers-claim-cyberattack-mobile-2023-12-13
2023-12-15 10:15:44,712 - root - DEBUG - response text: <html><head><title>reuters.com</title><style>#cmsg{animation: A 1.5s;}@keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><p id="cmsg">Please enable JS and disable any ad blocker</p><script data-cfasync="false">var dd={'rt':'c','cid':'AHrlqAAAAAMAY5rduiRiM_sABT8_8g==','hsh':'2013457ADA70C67D6A4123E0A76873','t':'fe','s':43909,'e':'4154644b4e799a7c95206619e5f860d485718881475c9617fbab1958edc297fb','host':'geo.captcha-delivery.com'}</script><script data-cfasync="false" src="https://ct.captcha-delivery.com/c.js"></script></body></html>
老实说我不知道如何避免这种情况。 感谢您的帮助。这是我当前的代码。
from requests_html import HTMLSession # Import HTMLSession
import requests
from bs4 import BeautifulSoup
import os
import subprocess
import logging
# Function to remove special characters and whitespace
def remove_special_characters(text):
return ''.join(e for e in text if e.isalnum())
# Function to process a single URL
def process_url(uri):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/96.0.4664.45 Safari/537.36',
'Referer': 'https://www.google.com/'
}
# Send a GET request to the URL using the standard requests module
response = requests.get(uri, headers=headers)
# Check if the request was successful
if response.status_code == 200:
# Create an HTML session
session = HTMLSession()
# Send a GET request to the URL using the HTML session
response = session.get(uri, headers=headers)
# Render JavaScript
response.html.render()
# Parse the HTML content of the page
soup = BeautifulSoup(response.html.html, 'html.parser') # Use response.html.html for parsed content
# Find the h1 tag
h1_tag = soup.find('h1')
if h1_tag:
# Get the text inside the h1 tag
h1_text = h1_tag.get_text()
# Remove special characters and whitespace from h1 text
h1_cleaned = remove_special_characters(h1_text)
# Create a directory for storing the files if it doesn't exist
if not os.path.exists("output"):
os.makedirs("output")
# Create a file with the h1 tag as the filename
filename = os.path.join("output", f"{h1_cleaned}.log")
# Find all paragraph tags
paragraphs = soup.find_all('p')
# Check if any <p> tag contains "Acquire Licensing Rights" and exclude it
paragraphs = [p for p in paragraphs if "Acquire Licensing Rights" not in p.get_text()]
# Combine the h1 text and paragraphs into one string
content = h1_text + '\n\n'
for paragraph in paragraphs:
content += paragraph.get_text() + '\n\n'
# Remove content after "Reporting"
index = content.find("Reporting")
if index != -1:
content = content[:index]
# Write the content to the file
with open(filename, 'w', encoding='utf-8') as file:
file.write(content)
print(f"Content saved to {filename}")
# Close the HTML session
session.close()
else:
print("No h1 tag found on the page.")
logging.debug('No H1 Tag could be found on ' + uri)
else:
logging.debug('Error code ' + str(response.status_code) + " " + uri)
print(f"Failed to retrieve the web page at URL: {uri}" + f" Error code {response.status_code}")
logging.debug('Failed to retrieve the web page at URL: ' + uri)
logging.debug('response text: ' + response.text)
# Read URLs from a file (e.g., urls.txt)
with open("urls.txt", "r") as url_file:
urls = url_file.read().splitlines()
logging.basicConfig(filename='blankocleaner.log', filemode='a', format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
level=logging.DEBUG)
# Process each URL
for url in urls:
process_url(url)
z = False
while not z:
z = bool(input("Can you confirm you read the logs"))
过去半年一直有效。我确实设法添加了标题,但它们仍然不会改变任何东西。所以我不太确定。
问题出在验证码上。无法绕过它,但您可以获取缓存的内容。
尝试在 urls.txt 的每一行之前连接以下 url
缓存= ‘http://webcache.googleusercontent.com/search?q=cache:’
urls = 缓存+url_file.read().splitlines()