我正在尝试使用 Scrapy CrawlSpider Class 从网站上抓取内容,但我被以下响应阻止。我想上面的错误与我的爬虫程序的User-Agent有关。因此,我必须添加自定义中间件用户代理,但响应仍然存在。请我需要您的帮助,关于如何解决此问题的建议。
我没有考虑使用splash,因为要抓取的内容和链接没有javascript扩展。
我的Scrapy蜘蛛类:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from datetime import datetime
import arrow
import re
import pandas as pd
class GreyhoundSpider(CrawlSpider):
name = 'greyhound'
allowed_domains = ['thegreyhoundrecorder.com.au/form-guides/']
start_urls = ['https://thegreyhoundrecorder.com.au/form-guides//']
base_url = 'https://thegreyhoundrecorder.com.au'
rules = (
Rule(LinkExtractor(restrict_xpaths="//tbody/tr/td[2]/a"), callback='parse_item', follow=True), #//tbody/tr/td[2]/a
)
def clean_date(dm):
year = pd.to_datetime('now').year # Get current year
race_date = pd.to_datetime(dm + ' ' + str(year)).strftime('%d/%m/%Y')
return race_date
def parse_item(self, response):
#Field = response.xpath ("//ul/li[1][@class='nav-item']/a/text()").extract_first() #all fileds
for race in response.xpath("//div[@class= 'fieldsSingleRace']"):
title = ''.join(race.xpath(".//div/h1[@class='title']/text()").extract_first())
Track = title.split('-')[0].strip()
date = title.split('-')[1].strip()
final_date = self.clean_date(date)
race_number = ''.join(race.xpath(".//tr[@id = 'tableHeader']/td[1]/text()").extract())
num = list(race_number)
final_race_number = "".join(num[::len(num)-1] )
Distance = race.xpath("//tr[@id = 'tableHeader']/td[3]/text()").extract()
TGR_Grade = race.xpath("//tr[@id = 'tableHeader']/td[4]/text()").extract()
TGR1 = response.xpath("//tbody/tr[1][@class='fieldsTableRow raceTipsRow']//div/span[1]/text()").extract()
TGR2 = response.xpath("//tbody/tr[1][@class='fieldsTableRow raceTipsRow']//div/span[2]/text()").extract()
TGR3 = response.xpath("//tbody/tr[1][@class='fieldsTableRow raceTipsRow']//div/span[3]/text()").extract()
TGR4 = response.xpath("//tbody/tr[1][@class='fieldsTableRow raceTipsRow']//div/span[4]/text()").extract()
yield {
'Track': Track,
'Date': final_date,
'#': final_race_number,
'Distance': Distance,
'TGR_Grade': TGR_Grade,
'TGR1': TGR1,
'TGR2': TGR2,
'TGR3': TGR3,
'TGR4': TGR4,
'user-agent': response.request.headers.get('User-Agent').decode('utf-8')
}
我的自定义中间件类:
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random, logging
class UserAgentRotatorMiddleware(UserAgentMiddleware):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
user_agents_list = [
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit /537.36 KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0',
'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)AppleWebKit /603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6)AppleWebKit /537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2)AppleWebKit /601.3.9 /601.3.9 (KHTML, like Gecko)',
'Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit /537.36 (KHTML, like Gecko) Chrome/51.0.2704.79Safari/537.36 Edge/14.14393'
]
def __init__(self, user_agent=''):
self.user_agent = user_agent
def process_request(self, request, spider):
try:
self.user_agent = random.choice(self.user_agents_list)
request.headers.setdefault('User-Agent', self.user_agent)
except IndexError:
logging.error("Couldn't fetch the user agent")
我还将 DOWNLOADER_MIDDLEWARES 更改为我的自定义中间件
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'greyhound_recorder_website.middlewares.UserAgentRotatorMiddleware': 400,
}
设置自动油门
AUTOTHROTTLE_ENABLED = True
这是网站的robots.txt。
User-agent: bingbot
Crawl-delay: 10
User-agent: SemrushBot
Disallow: /
User-agent: SemrushBot-SA
Disallow: /
User-agent: Yandex
Disallow: /
User-agent: *
Disallow: /wp-admin/
终端上的蜘蛛响应:
2021-09-24 11:52:06 [scrapy.core.engine] INFO: Spider opened
2021-09-24 11:52:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-09-24 11:52:06 [scrapy.extensions.httpcache] DEBUG: Using filesystem cache storage in C:\Users\8470p\Desktop\web-scraping\greyhound_recorder_website\.scrapy\httpcache
2021-09-24 11:52:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-09-24 11:52:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://thegreyhoundrecorder.com.au/robots.txt> (referer: None) ['cached']
2021-09-24 11:52:06 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2021-09-24 11:52:06 [protego] DEBUG: Rule at line 2 without any user agent to enforce it on.
2021-09-24 11:52:06 [protego] DEBUG: Rule at line 3 without any user agent to enforce it on.
2021-09-24 11:52:06 [protego] DEBUG: Rule at line 4 without any user agent to enforce it on.
主要障碍是
allowed_domains
。你必须小心,否则 Crawlspider 无法产生所需的输出,并且 start_url 末尾的 for //
可能会出现另一个原因,因此你应该使用 /
而不是 allowed_domains = ['thegreyhoundrecorder.com.au/form-guides/']
您只能使用如下域名:
allowed_domains = ['thegreyhoundrecorder.com.au']
最后,您可以在 settings.py 文件中添加真正的用户代理,设置总是更好的做法
robots.txt = False
我也有类似的问题。我可以通过在我的
settings.py
文件中添加这些行来修复它。
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 OPR/45.0.2552.888'
ROBOTSTXT_OBEY = False