使用 scrapy 和 selenium 进行网页抓取的“加载更多”按钮 [编辑]

问题描述 投票:0回答:1

我目前正在尝试从尼泊尔时报网站上抓取文章。我面临的挑战是该网站使用了“加载更多”按钮,我需要单击该按钮才能加载其他文章。但是,我的抓取过程成功检索了包含前六篇文章的初始页面,但无法单击“加载更多”按钮来加载其余文章。结果,除了最初的六篇文章之外,我无法抓取任何内容。

此外,在抓取过程中,它继续获取 URL,但没有获得所需的内容,而是返回“oops”页面,这表明 Selenium 和按钮单击功能存在问题。

如果有人可以向我解释我该如何处理这个问题?我将非常感激!

import scrapy
import json
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.http.request import Request

class NepaliSpider(CrawlSpider):
    name = "nepalitimes"
    allowed_domains = ["nepalitimes.com"]
    # Start URL for the spider
    start_urls = ['https://www.nepalitimes.com/news']

    custom_settings = {
        'FEED_FORMAT': 'csv',
        'FEED_URI': 'nepali_times.csv'
    }

    # Rule to follow links to individual article pages
    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

# Handling the load button using Selenium --- En cours de pulvérisation <3
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse)

    def parse(self, response, **kwargs):
        # Parse the articles from the initial page
        for result in response.xpath(".//div[contains(@class,'main--left')]/a"):
            relative_url = result.xpath("@href").extract_first()
            absolute_url = response.urljoin(relative_url)
            yield scrapy.Request(url=absolute_url, callback=self.parse_item)

        # Check if there is a "Load More" button
        load_more_button = response.xpath(".//button[contains(@class, 'btn btn--load center') and contains(., 'load more')]")
        if load_more_button:
            print("Load more button detected")
            tenant_code = "epz639"
            routeId = 8
            limit = 10
            offset = 10  

            # Prepare the data payload for the POST request
            data = {
                "query": "query getMoreArticles($tenant_code: String, $routeId: Int, $limit: Int, $offset: Int) { articles: getPublicContent(tenant_code: $tenant_code, routeId: $routeId, limit: $limit, offset: $offset) { id } }",
                "variables": {
                    "tenant_code": tenant_code,
                    "routeId": routeId,
                    "limit": limit,
                    "offset": offset
                }
            }

            # Send a POST request to the endpoint using scrapy.FormRequest
            yield scrapy.FormRequest(url="https://nepalitimes-hasura.superdesk.org/v1/graphql",
                                     formdata={"query": json.dumps(data["query"]), "variables": json.dumps(data["variables"])},
                                     headers={"Content-Type": "application/json"},
                                     callback=self.parse_ajax_response)
            print("Post resquest sent")

    def parse_ajax_response(self, response):
        if 'data' in json_response and 'articles' in json_response['data']:
            articles = json_response['data']['articles']
            print("Articles :", articles)
            for article in articles:
                # Assuming there's an 'id' field in the response representing the article ID
                article_id = article['id']
                article_url = f"https://www.nepalitimes.com/news/{article_id}"  # Adjust this based on the actual URL structure
                yield scrapy.Request(url=article_url, callback=self.parse_item)

    def parse_item(self, response):
        # This function should extract the article information from the provided response
        # and yield the scraped data as a dictionary

        # Extract article information using XPath selectors
        title = response.xpath('.//article[contains(@class,"article__full")]/h1/text()').get()
        subtitle = response.xpath('.//span[contains(@class,"article__subhead")]/text()').get()
        date = response.xpath(".//div/time[contains(@class,'article__time')]/text()").get()
        author = response.xpath('.//div/span[contains(@class,"article__author")]/span/text()').get()
        category = response.xpath(".//a[contains(@class,'active')]/text()").get()
        url = response.xpath(".//meta[contains(@property, 'og:url')]/@content").get()

        # Parse the HTML content
        content_elements = response.xpath('.//div[contains(@class,"article__text")]/p')
        text_content = [element.xpath("string(.)").get().strip() for element in content_elements]
        cleaned_content = ' '.join(text_content)

        yield {
            'title': title,
            'subtitle': subtitle,
            'author': author,
            'date': date,
            'content': cleaned_content,
            'category': category,
            'URL': url
        }

好吧,所以我尝试了@Leandro的建议,也就是说,使用 chrom devtools 而不是 Selenium,但它似乎没有启动 def parse_ajax 函数......但它仍然起作用,没有给出我想要的结果(仅报废 9 件)。我需要一些帮助。

这是我点击“加载按钮”时得到的信息:

python selenium-webdriver web-scraping scrapy
1个回答
1
投票

我认为最好的方法是检查当您单击“加载更多”按钮时正在发出什么请求。例如,可以使用 Chrome 开发工具中的“网络”选项卡来完成此操作。然后,您可以在加载第一页后在 Scrapy 中安排此请求。也许,这个请求会返回一些类似 JSON 的结构,您可以用不同的方法处理它(请参阅

Request
对象中的 callback 参数)。

这样,您就可以摆脱硒,使您的刮刀变得更轻。我希望这有帮助:)

© www.soinside.com 2019 - 2024. All rights reserved.