我想每隔一段时间请求一次页面,看看内容是否更新了,但是我自己的回调函数没有被触发 我的 allowed_domains 和请求 url 是
allowed_domains = ['www1.hkexnews.hk']
start_urls = 'https://www1.hkexnews.hk/search/predefineddoc.xhtml?lang=zh&predefineddocuments=9'
解析部分的代码是
#Crawl all data first at each start
def parse(self, response):
Total_records = int(re.findall("\d+",response.xpath("//div[@class='PD-TotalRecords']/text()").extract()[0])[0])
dict = {}
is_Latest = True
global Latest_info
global previous_hash
for i in range(1, Total_records + 1):
content = response.xpath("//table/tbody/tr[{}]//text()".format(i)).extract()
# Use the group function to group the list by key
result = list(group(content, self.keys))
Time = dict['Time'] = result[0].get(self.keys[0])
Code = dict['Code'] = result[1].get(self.keys[1])
dict['Name'] = result[2].get(self.keys[2])
if is_Latest:
Latest_info = str(Time) + " | " + str(Code)
is_Latest = False
yield dict
previous_hash = get_hash(Latest_info.encode('utf-8'))
#Monitor data updates and crawl for new data
while True:
time.sleep(10)
# Request website content and calculate hash values
yield scrapy.Request(url=self.start_urls, callback=self.parse_check, dont_filter=True)
我自己的回调函数是
def parse_check(self, response):
global previous_hash
global Latest_info
dict = {}
content = response.xpath("//table/tbody/tr[1]//text()").extract()
# Use the group function to group the list by key
result = list(group(content, self.keys))
Time = result[0].get(self.keys[0])
Code = result[1].get(self.keys[1])
current_info = str(Time) + " | " + str(Code)
current_hash = get_hash(current_info.encode('utf-8'))
# Compare hash values to determine if website content is updated
if current_hash != previous_hash:
dict['Time'] = Time
dict['Code'] = Code
dict['Name'] = result[2].get(self.keys[2])
previous_hash = current_hash
Latest_info = current_info
yield dict
我尝试输出 errback 但没有内容,之后我尝试使用 requests.get 请求页面而不是 yield scrapy.Request 并且成功了,但我仍然不知道为什么我的回调函数不起作用
我知道为什么,至少这个对我有用,就是尽量不要在scrapy中使用time.sleep。因为它会阻塞 Twisted reactor(Scrapy 的底层框架),这将完全阻塞你的 Scrapy 蜘蛛并停止所有 Scrapy 并发功能。您可以使用 DOWNLOAD_DELAY 函数或使用 AutoThrottle 自动油门