有没有办法让我可以用Python脚本获取路透社文章?

问题描述 投票:0回答:1

我经常遇到的问题是,由于注意力不集中,我无法阅读路透社的整篇文章。因此,我采用了此脚本,对其进行了调整以阅读新闻文章,并使用 AWS Polly 将其传输到音频文件中。从几天前开始,HTML 无法被接收,我得到了一个 HTTP 代码 401,并有以下日志:

2023-12-15 10:15:44,621 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): www.reuters.com:80
2023-12-15 10:15:44,632 - urllib3.connectionpool - DEBUG - http://www.reuters.com:80 "GET /technology/cybersecurity/ukraine-says-russian-intelligence-linked-hackers-claim-cyberattack-mobile-2023-12-13 HTTP/1.1" 301 167
2023-12-15 10:15:44,634 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): www.reuters.com:443
2023-12-15 10:15:44,709 - urllib3.connectionpool - DEBUG - https://www.reuters.com:443 "GET /technology/cybersecurity/ukraine-says-russian-intelligence-linked-hackers-claim-cyberattack-mobile-2023-12-13 HTTP/1.1" 401 582
2023-12-15 10:15:44,712 - root - DEBUG - Error code 401 http://www.reuters.com/technology/cybersecurity/ukraine-says-russian-intelligence-linked-hackers-claim-cyberattack-mobile-2023-12-13
2023-12-15 10:15:44,712 - root - DEBUG - Failed to retrieve the web page at URL: http://www.reuters.com/technology/cybersecurity/ukraine-says-russian-intelligence-linked-hackers-claim-cyberattack-mobile-2023-12-13
2023-12-15 10:15:44,712 - root - DEBUG - response text: <html><head><title>reuters.com</title><style>#cmsg{animation: A 1.5s;}@keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><p id="cmsg">Please enable JS and disable any ad blocker</p><script data-cfasync="false">var dd={'rt':'c','cid':'AHrlqAAAAAMAY5rduiRiM_sABT8_8g==','hsh':'2013457ADA70C67D6A4123E0A76873','t':'fe','s':43909,'e':'4154644b4e799a7c95206619e5f860d485718881475c9617fbab1958edc297fb','host':'geo.captcha-delivery.com'}</script><script data-cfasync="false" src="https://ct.captcha-delivery.com/c.js"></script></body></html>

老实说我不知道如何避免这种情况。 感谢您的帮助。这是我当前的代码。


from requests_html import HTMLSession  # Import HTMLSession
import requests
from bs4 import BeautifulSoup
import os
import subprocess
import logging


# Function to remove special characters and whitespace
def remove_special_characters(text):
    return ''.join(e for e in text if e.isalnum())


# Function to process a single URL
def process_url(uri):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/96.0.4664.45 Safari/537.36',
        'Referer': 'https://www.google.com/'
    }

    # Send a GET request to the URL using the standard requests module
    response = requests.get(uri, headers=headers)

    # Check if the request was successful
    if response.status_code == 200:
        # Create an HTML session
        session = HTMLSession()

        # Send a GET request to the URL using the HTML session
        response = session.get(uri, headers=headers)

        # Render JavaScript
        response.html.render()

        # Parse the HTML content of the page
        soup = BeautifulSoup(response.html.html, 'html.parser')  # Use response.html.html for parsed content

        # Find the h1 tag
        h1_tag = soup.find('h1')

        if h1_tag:
            # Get the text inside the h1 tag
            h1_text = h1_tag.get_text()

            # Remove special characters and whitespace from h1 text
            h1_cleaned = remove_special_characters(h1_text)

            # Create a directory for storing the files if it doesn't exist
            if not os.path.exists("output"):
                os.makedirs("output")

            # Create a file with the h1 tag as the filename
            filename = os.path.join("output", f"{h1_cleaned}.log")

            # Find all paragraph tags
            paragraphs = soup.find_all('p')

            # Check if any <p> tag contains "Acquire Licensing Rights" and exclude it
            paragraphs = [p for p in paragraphs if "Acquire Licensing Rights" not in p.get_text()]

            # Combine the h1 text and paragraphs into one string
            content = h1_text + '\n\n'
            for paragraph in paragraphs:
                content += paragraph.get_text() + '\n\n'

            # Remove content after "Reporting"
            index = content.find("Reporting")
            if index != -1:
                content = content[:index]

            # Write the content to the file
            with open(filename, 'w', encoding='utf-8') as file:
                file.write(content)

            print(f"Content saved to {filename}")

            # Close the HTML session
            session.close()
        else:
            print("No h1 tag found on the page.")
            logging.debug('No H1 Tag could be found on ' + uri)
    else:
        logging.debug('Error code ' + str(response.status_code) + " " + uri)
        print(f"Failed to retrieve the web page at URL: {uri}" + f" Error code {response.status_code}")
        logging.debug('Failed to retrieve the web page at URL: ' + uri)
        logging.debug('response text: ' + response.text)


# Read URLs from a file (e.g., urls.txt)
with open("urls.txt", "r") as url_file:
    urls = url_file.read().splitlines()

logging.basicConfig(filename='blankocleaner.log', filemode='a', format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
                    level=logging.DEBUG)
# Process each URL
for url in urls:
    process_url(url)

z = False
while not z:
    z = bool(input("Can you confirm you read the logs"))

过去半年一直有效。我确实设法添加了标题,但它们仍然不会改变任何东西。所以我不太确定。

python http web-scraping
1个回答
0
投票

问题出在验证码上。无法绕过它,但您可以获取缓存的内容。

尝试在 urls.txt 的每一行之前连接以下 url

缓存= ‘http://webcache.googleusercontent.com/search?q=cache:’

urls = 缓存+url_file.read().splitlines()

© www.soinside.com 2019 - 2024. All rights reserved.