Scrapy 请求 - 我自己的回调函数没有被调用

问题描述 投票:0回答:1

我想每隔一段时间请求一次页面,看看内容是否更新了,但是我自己的回调函数没有被触发 我的 allowed_domains 和请求 url 是

allowed_domains = ['www1.hkexnews.hk']
start_urls = 'https://www1.hkexnews.hk/search/predefineddoc.xhtml?lang=zh&predefineddocuments=9'

解析部分的代码是

#Crawl all data first at each start
    def parse(self, response):
        Total_records = int(re.findall("\d+",response.xpath("//div[@class='PD-TotalRecords']/text()").extract()[0])[0])
        dict = {}
        is_Latest = True
        global Latest_info
        global previous_hash

        for i in range(1, Total_records + 1):
            content = response.xpath("//table/tbody/tr[{}]//text()".format(i)).extract()

            # Use the group function to group the list by key
            result = list(group(content, self.keys))
            Time = dict['Time'] = result[0].get(self.keys[0])
            Code = dict['Code'] = result[1].get(self.keys[1])
            dict['Name'] = result[2].get(self.keys[2])
            if is_Latest:
                Latest_info = str(Time) + " | " + str(Code)
                is_Latest = False

            yield dict

        previous_hash = get_hash(Latest_info.encode('utf-8'))
        #Monitor data updates and crawl for new data
        while True:
            time.sleep(10)
            # Request website content and calculate hash values
            yield scrapy.Request(url=self.start_urls, callback=self.parse_check, dont_filter=True)

我自己的回调函数是

    def parse_check(self, response):
        global previous_hash
        global Latest_info
        dict = {}
        content = response.xpath("//table/tbody/tr[1]//text()").extract()
        # Use the group function to group the list by key
        result = list(group(content, self.keys))
        Time =  result[0].get(self.keys[0])
        Code = result[1].get(self.keys[1])

        current_info = str(Time) + " | " + str(Code)
        current_hash = get_hash(current_info.encode('utf-8'))

        # Compare hash values to determine if website content is updated
        if current_hash != previous_hash:

            dict['Time'] = Time
            dict['Code'] = Code
            dict['Name'] = result[2].get(self.keys[2])

            previous_hash = current_hash
            Latest_info = current_info
        yield dict

我尝试输出 errback 但没有内容,之后我尝试使用 requests.get 请求页面而不是 yield scrapy.Request 并且成功了,但我仍然不知道为什么我的回调函数不起作用

python-3.x callback scrapy web-crawler
1个回答
0
投票

我知道为什么,至少这个对我有用,就是尽量不要在scrapy中使用time.sleep。因为它会阻塞 Twisted reactor(Scrapy 的底层框架),这将完全阻塞你的 Scrapy 蜘蛛并停止所有 Scrapy 并发功能。您可以使用 DOWNLOAD_DELAY 函数或使用 AutoThrottle 自动油门

© www.soinside.com 2019 - 2024. All rights reserved.