如何使用scrapy从网页中提取链接?

问题描述 投票:0回答:1

我正在尝试从遵循特定规则的网页中提取链接。我尝试使用 scrapy 并使用以下代码:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request


class MagazineCrawler(CrawlSpider):
    name = "MagazineCrawler"
    allowed_domains = ["eu-startups.com"]
    start_urls = ["https://www.eu-startups.com"]

    rules = (
        Rule(LinkExtractor(allow=["category/interviews"]), callback="parse_category"),
    )

    def parse_category(self, response):
        xpath_links = "//div[@class='td_block_inner tdb-block-inner td-fix-index']//a[@class='td-image-wrap ']/@href"
        subpage_links = response.xpath(xpath_links).extract()

        # Follow each subpage link and yield requests to crawl them
        for link in subpage_links:
            yield Request(link)

问题是它只从模式提供的第一个链接中提取链接,然后就停止了。如果我删除 parse_category 回调选项,它通常会浏览其中包含“类别/访谈”的所有网页。为什么会出现这种情况?

python scrapy
1个回答
0
投票

发生这种情况是因为如果您计划将其与回调一起使用,则需要在规则中设置 follow 参数。

来自规则类的scrapy 文档

class scrapy.spiders.Rule(link_extractor=None, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=None, errback=None)

follow
是一个布尔值,指定是否应从使用此规则提取的每个响应中遵循链接。如果回调为 None follow 默认为 True,否则默认为 False。

因此,如果您希望蜘蛛继续跟踪链接并为每个响应使用回调,那么您只需在蜘蛛中设置

follow=True
rule

例如:

class MagazineCrawler(CrawlSpider):
    name = "MagazineCrawler"
    allowed_domains = ["eu-startups.com"]
    start_urls = ["https://www.eu-startups.com"]

    rules = (
        Rule(LinkExtractor(allow=["category/interviews"]),
             callback="parse_category", 
             follow=True),
    )

    def parse_category(self, response):
        xpath_links = "//div[@class='td_block_inner tdb-block-inner td-fix-index']//a[@class='td-image-wrap ']/@href"
        subpage_links = response.xpath(xpath_links).extract()

        # Follow each subpage link and yield requests to crawl them
        for link in subpage_links:
            yield Request(link)
© www.soinside.com 2019 - 2024. All rights reserved.