分页,用scrapy下一页

问题描述 投票:0回答:1

下一页按钮按下时不会更改 url,所以我在 scrapy 方面遇到问题。

'''

import scrapy

class LegonSpider(scrapy.Spider):
    name = "legon"

    def start_requests(self):
        yield scrapy.Request(
            url="https://mylegion.org/PersonifyEbusiness/Find-a-Post",
            callback=self.parse
        )

    def parse(self, response):
        # Select distance and country
        yield scrapy.FormRequest.from_response(
            response,
            formid='aspnetForm',
            formdata={'dnn$ctr2802$DNNWebControlContainer$ctl00$DistanceList': '100',
                      '@IP_COUNTRY': 'USA',
                      '@IP_DEPARTMENT': '00000000001L'},
            callback=self.parse_post_page
        )
    def parse_post_page(self, response):
        # Extract and yield requests for post detail pages
        post_elements = response.xpath("//div[@class='membership-dir-result-item']")
        for post_element in post_elements:
            post_num = post_element.xpath(".//div[contains(@class,'POST_NAME')]/text()").get().strip()
            post_link = post_element.xpath("./a/@href").get()
            yield response.follow(post_link, callback=self.parse_post_detail, meta={'post_num': post_num})


        next_page_button = response.xpath("/input[@id='dnn_ctr2802_DNNWebControlContainer_ctl00_Next']")
        if next_page_button:

        # Extract form data for next page submission
            formdata = {
                '__EVENTTARGET': 'dnn$ctr2802$DNNWebControlContainer$ctl00$Next',
                '__EVENTARGUMENT': ''
                }
            yield scrapy.FormRequest.from_response(response, formdata=formdata, callback=self.parse_post_page)
        
    def parse_post_detail(self,response):
        leader1 = response.xpath("(//div[contains(@class,'Leadership')]/div[2]/text())[1]").get()
        leader2 = response.xpath("(//div[contains(@class,'Leadership')]/div[2]/text())[2]").get()
        address = response.xpath("//div[contains(@class,'Address')]/div[2]/text()").get()
        typ = response.xpath("//div[contains(@class,'Type')]/div[2]/text()").get()

        yield {
            "post_num": response.meta['post_num'],
            "leader1": leader1,
            "leader2": leader2,
            "address": address,
            "type" : typ

        }
        

我认为 scrapy 甚至没有进入下一页,他将进入基本网址,当我按下一页或我尝试使用新的搜索方法时,基本网址根本不会改变。

python html scrapy request screen-scraping
1个回答
0
投票

当我检查回复时,我发现我一遍又一遍地收到相同的页面。

如果我们使用 BurpSuite 检查请求并比较它们,我们可以看到这部分:

您可以在右侧看到值“Next”,但是如果我们检查响应中的表单数据,我们可以看到该值丢失。我们只需要添加它:

import scrapy


class LegonSpider(scrapy.Spider):
    name = "legon"

    def start_requests(self):
        yield scrapy.Request(
            url="https://mylegion.org/PersonifyEbusiness/Find-a-Post",
            callback=self.parse
        )

    def parse(self, response):
        # Select distance and country
        yield scrapy.FormRequest.from_response(
            response,
            formid='aspnetForm',
            formdata={'dnn$ctr2802$DNNWebControlContainer$ctl00$DistanceList': '100',
                      '@IP_COUNTRY': 'USA',
                      '@IP_DEPARTMENT': '00000000001L'},
            callback=self.parse_post_page
        )

    def parse_post_page(self, response):
        post_elements = response.xpath("//div[@class='membership-dir-result-item']")
        for post_element in post_elements:
            post_num = post_element.xpath(".//div[contains(@class,'POST_NAME')]/text()").get().strip()
            post_link = post_element.xpath("./a/@href").get()
            yield response.follow(post_link, callback=self.parse_post_detail, meta={'post_num': post_num})

        next_page_button = response.xpath("//input[@id='dnn_ctr2802_DNNWebControlContainer_ctl00_Next']")
        if next_page_button:
            form_data = {'dnn$ctr2802$DNNWebControlContainer$ctl00$Next': 'Next'}
            yield scrapy.FormRequest.from_response(response, formdata=form_data, callback=self.parse_post_page)

    def parse_post_detail(self, response):
        leader1 = response.xpath("(//div[contains(@class,'Leadership')]/div[2]/text())[1]").get()
        leader2 = response.xpath("(//div[contains(@class,'Leadership')]/div[2]/text())[2]").get()
        address = response.xpath("//div[contains(@class,'Address')]/div[2]/text()").get()
        typ = response.xpath("//div[contains(@class,'Type')]/div[2]/text()").get()

        yield {
            "post_num": response.meta['post_num'],
            "leader1": leader1,
            "leader2": leader2,
            "address": address,
            "type": typ
        }

查看我的表单数据和你的表单数据之间的差异。

顺便说一句,您在

/
的选择器中错过了
next_page_button

© www.soinside.com 2019 - 2024. All rights reserved.