DEBUG:第 3 行的规则,没有任何用户代理在 Python Scrapy 上强制执行它

问题描述 投票:0回答:2

我正在尝试使用 Scrapy CrawlSpider Class 从网站上抓取内容,但我被以下响应阻止。我想上面的错误与我的爬虫程序的User-Agent有关。因此,我必须添加自定义中间件用户代理,但响应仍然存在。请我需要您的帮助,关于如何解决此问题的建议。

我没有考虑使用splash,因为要抓取的内容和链接没有javascript扩展。

我的Scrapy蜘蛛类:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from datetime import datetime
import arrow
import re
import pandas as pd

class GreyhoundSpider(CrawlSpider):
    name = 'greyhound'
    allowed_domains = ['thegreyhoundrecorder.com.au/form-guides/']
    start_urls = ['https://thegreyhoundrecorder.com.au/form-guides//']
    base_url =  'https://thegreyhoundrecorder.com.au'

    rules = (
        Rule(LinkExtractor(restrict_xpaths="//tbody/tr/td[2]/a"), callback='parse_item', follow=True), #//tbody/tr/td[2]/a
    )

    def clean_date(dm):
        year = pd.to_datetime('now').year     # Get current year
        race_date =  pd.to_datetime(dm + ' ' + str(year)).strftime('%d/%m/%Y')
        return race_date

    def parse_item(self, response):
        #Field =  response.xpath ("//ul/li[1][@class='nav-item']/a/text()").extract_first() #all fileds
        for race in response.xpath("//div[@class= 'fieldsSingleRace']"):
            title = ''.join(race.xpath(".//div/h1[@class='title']/text()").extract_first())
            Track = title.split('-')[0].strip()
            date = title.split('-')[1].strip()
            final_date = self.clean_date(date)
            race_number = ''.join(race.xpath(".//tr[@id = 'tableHeader']/td[1]/text()").extract())
            num = list(race_number)
            final_race_number = "".join(num[::len(num)-1] )
            Distance = race.xpath("//tr[@id = 'tableHeader']/td[3]/text()").extract()
            TGR_Grade = race.xpath("//tr[@id = 'tableHeader']/td[4]/text()").extract()
        TGR1 = response.xpath("//tbody/tr[1][@class='fieldsTableRow raceTipsRow']//div/span[1]/text()").extract()
        TGR2 = response.xpath("//tbody/tr[1][@class='fieldsTableRow raceTipsRow']//div/span[2]/text()").extract()
        TGR3 = response.xpath("//tbody/tr[1][@class='fieldsTableRow raceTipsRow']//div/span[3]/text()").extract()
        TGR4 = response.xpath("//tbody/tr[1][@class='fieldsTableRow raceTipsRow']//div/span[4]/text()").extract()
        
        yield {
                'Track': Track,
                'Date': final_date,
                '#': final_race_number,
                'Distance': Distance,
                'TGR_Grade': TGR_Grade,
                'TGR1': TGR1,
                'TGR2': TGR2,
                'TGR3': TGR3,
                'TGR4': TGR4,
                'user-agent': response.request.headers.get('User-Agent').decode('utf-8')
              }

我的自定义中间件类:

from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random, logging

class UserAgentRotatorMiddleware(UserAgentMiddleware):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.
    user_agents_list = [
    
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit /537.36 KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
    'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0',
    'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)AppleWebKit /603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6)AppleWebKit /537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2)AppleWebKit /601.3.9 /601.3.9 (KHTML, like Gecko)',
    'Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit /537.36 (KHTML, like Gecko) Chrome/51.0.2704.79Safari/537.36 Edge/14.14393'

    ]

    def __init__(self, user_agent=''):
        self.user_agent = user_agent

    def process_request(self, request, spider):
        try:
            self.user_agent = random.choice(self.user_agents_list)
            request.headers.setdefault('User-Agent', self.user_agent)
            
        except IndexError:
            logging.error("Couldn't fetch the user agent")

我还将 DOWNLOADER_MIDDLEWARES 更改为我的自定义中间件

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'greyhound_recorder_website.middlewares.UserAgentRotatorMiddleware': 400,
    
}

设置自动油门

AUTOTHROTTLE_ENABLED = True

这是网站的robots.txt。

User-agent: bingbot
Crawl-delay: 10

User-agent: SemrushBot
Disallow: /

User-agent: SemrushBot-SA
Disallow: /

User-agent: Yandex
Disallow: /

User-agent: *
Disallow: /wp-admin/

终端上的蜘蛛响应:

2021-09-24 11:52:06 [scrapy.core.engine] INFO: Spider opened
2021-09-24 11:52:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-09-24 11:52:06 [scrapy.extensions.httpcache] DEBUG: Using filesystem cache storage in C:\Users\8470p\Desktop\web-scraping\greyhound_recorder_website\.scrapy\httpcache
2021-09-24 11:52:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-09-24 11:52:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://thegreyhoundrecorder.com.au/robots.txt> (referer: None) ['cached']
2021-09-24 11:52:06 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2021-09-24 11:52:06 [protego] DEBUG: Rule at line 2 without any user agent to enforce it on.
2021-09-24 11:52:06 [protego] DEBUG: Rule at line 3 without any user agent to enforce it on.
2021-09-24 11:52:06 [protego] DEBUG: Rule at line 4 without any user agent to enforce it on.
python python-requests scrapy
2个回答
2
投票

主要障碍是

allowed_domains
。你必须小心,否则 Crawlspider 无法产生所需的输出,并且 start_url 末尾的
for //
可能会出现另一个原因,因此你应该使用
/
而不是
allowed_domains = ['thegreyhoundrecorder.com.au/form-guides/']

您只能使用如下域名:

allowed_domains = ['thegreyhoundrecorder.com.au']

最后,您可以在 settings.py 文件中添加真正的用户代理,设置总是更好的做法

robots.txt = False


0
投票

我也有类似的问题。我可以通过在我的

settings.py
文件中添加这些行来修复它。

USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 OPR/45.0.2552.888'

ROBOTSTXT_OBEY = False

© www.soinside.com 2019 - 2024. All rights reserved.