我正在使用 scrapy 从 Yelp 中抓取数据。我看不到任何错误,但数据没有从蜘蛛中提到的 StartURL 中抓取

问题描述 投票:0回答:1

items.py 和其他文件的代码如下所述。最后还提到了日志。我没有收到任何错误,但根据日志,scrapy 没有抓取任何页面。

```
import scrapy


class YelpItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    name_url = scrapy.Field()
    rating = scrapy.Field()
    date = scrapy.Field()
    review_text = scrapy.Field()
    user_pic = scrapy.Field()
    city = scrapy.Field()
    is_true = scrapy.Field()
```

settings.py代码

import pathlib
BOT_NAME = 'yelp-scrapy-dev'

SPIDER_MODULES = ['yelp-scrapy-dev.spiders']
NEWSPIDER_MODULE = 'yelp-scrapy-dev.spiders'

{
pathlib.Path('output1.csv'):{
    'format':'csv',
},
}
ROBOTSTXT_OBEY = False

pipelines.py 代码

class YelpPipeline:
    def open_spider(self, spider):
        self.file = open('output1.csv', 'w')

    def close_spider(self, spider):
        self.file.close()
        
    def process_item(self, item, spider):
        return item

middlewares.py 代码

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class YelpSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class YelpDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

城市蜘蛛的代码。蜘蛛从指定的 URL 收集评论

import scrapy
from ..items import YelpItem

# currently will grab the first 100 reviews from the first 10 businesses from start url
class CitySpider(scrapy.Spider):

    name = 'city'
    start_urls = [
        'https://www.yelp.com/search?find_desc=&find_loc=Seattle%2C+WA',
        'https://www.yelp.com/search?find_desc=&find_loc=SanFrancisco%2C+CA',
        'https://www.yelp.com/search?find_desc=&find_loc=NewYork%2C+NY',
        'https://www.yelp.com/search?find_desc=&find_loc=Dallas%2C+TX',
        'https://www.yelp.com/search?find_desc=&find_loc=Atlanta%2C+GA',
    ]

    # gets the first 10 businesses from the start url
    def parse(self, response):
        
        business_pages = response.css('.text-weight--bold__373c0__1elNz a')
        yield from response.follow_all(business_pages, self.parse_business)

    # extracts the first 100 reviews from the yelp-scrapy-dev business
    def parse_business(self, response):

        items = YelpItem()
        all_reviews = response.css('.sidebarActionsHoverTarget__373c0__2kfhE')
        
        address = response.request.url.split('?')
        src = address[0].split('/')
        biz = src[-1].split('-')
        loc = biz[-1] if not biz[-1].isdigit() else biz[-2]
        if loc == 'seattle':
            city = 'Seattle, WA'
        elif loc == 'dallas':
            city = 'Dallas, TX'
        elif loc == 'francisco':
            city = 'San Francisco, CA'
        elif loc == 'york':
            city = 'New York, NY'
        elif loc == 'atlanta':
            city = 'Atlanta, GA'
        else:
            city = 'outofrange'

        for review in all_reviews:
            name = review.css('.link-size--inherit__373c0__1VFlE::text').extract_first()
            name_url = review.css('.link-size--inherit__373c0__1VFlE::attr(href)').extract_first().split('=')
            rating = review.css('.overflow--hidden__373c0__2y4YK::attr(aria-label)').extract()
            date = review.css('.arrange-unit-fill__373c0__3Sfw1 .text-color--mid__373c0__jCeOG::text').extract()
            review_text = review.css('.raw__373c0__3rKqk::text').extract()
            user_pic = review.css('.gutter-1__373c0__2l5bx .photo-box-img__373c0__35y5v::attr(src)').extract()

            if city != 'outofrange':
                # making sure data is stored as a str
                items['name'] = name
                items['name_url'] = name_url[1]
                items['rating'] = rating[0]
                items['date'] = date[0]
                items['review_text'] = review_text[0]
                items['user_pic'] = user_pic[0] != 'https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_styleguide/514f6997a318/assets/img/default_avatars/user_60_square.png'
                items['city'] = city
                items['is_true'] = True

                yield items

        source = response.request.url

        # prevent duplicate secondary pages from being recrawled
        if '?start=' not in source:
            # gets 20th-100th reviews, pages are every 20 reviews
            for i in range(1, 5):
                next_page = source + '?start=' + str(i*20)
                yield response.follow(next_page, callback=self.parse_business)

下面是日志行。

(venv) C:\Users\somar\yelp-scrapy\yelp>scrapy crawl city
2020-10-09 22:34:53 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: yelp-scrapy-dev)
2020-10-09 22:34:53 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7
.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10
.0.18362-SP0
2020-10-09 22:34:53 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-09 22:34:53 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'yelp-scrapy-dev',
 'NEWSPIDER_MODULE': 'yelp-scrapy-dev.spiders',
 'SPIDER_MODULES': ['yelp-scrapy-dev.spiders']}
2020-10-09 22:34:53 [scrapy.extensions.telnet] INFO: Telnet Password: 1f95c571b9245c42
2020-10-09 22:34:53 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2020-10-09 22:34:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-09 22:34:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-09 22:34:54 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-10-09 22:34:54 [scrapy.core.engine] INFO: Spider opened
2020-10-09 22:34:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-09 22:34:54 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-09 22:34:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/search?find_desc=&find_loc=Dallas%2C+TX> (referer: None)
2020-10-09 22:34:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/search?find_desc=&find_loc=Atlanta%2C+GA> (referer: None)
2020-10-09 22:34:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/search?find_desc=&find_loc=Seattle%2C+WA> (referer: None)
2020-10-09 22:34:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/search?find_desc=&find_loc=NewYork%2C+NY> (referer: None)
2020-10-09 22:34:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/search?find_desc=&find_loc=SanFrancisco%2C+CA> (referer: None)
2020-10-09 22:34:56 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-09 22:34:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1264,
 'downloader/request_count': 5,
 'downloader/request_method_count/GET': 5,
 'downloader/response_bytes': 278234,
 'downloader/response_count': 5,
 'downloader/response_status_count/200': 5,
 'elapsed_time_seconds': 2.159687,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 10, 10, 5, 34, 56, 173193),
 'log_count/DEBUG': 5,
 'log_count/INFO': 10,
 'response_received_count': 5,
 'scheduler/dequeued': 5,
 'scheduler/dequeued/memory': 5,
 'scheduler/enqueued': 5,
 'scheduler/enqueued/memory': 5,
 'start_time': datetime.datetime(2020, 10, 10, 5, 34, 54, 13506)}
2020-10-09 22:34:56 [scrapy.core.engine] INFO: Spider closed (finished)
web-scraping scrapy scrapy-pipeline scrapinghub
1个回答
0
投票

发生这种情况是因为 Yelp 的 robots.txt 文件不允许网络爬虫访问您尝试抓取的特定 URL。

与许多网站一样,Yelp 使用 robots.txt 来指示网络爬虫其网站的哪些部分禁止索引和抓取。当网站的 robots.txt 文件不允许特定的 URL 或目录时,Scrapy 等网络爬虫通常会遵守这些规则,并且不会获取不允许的内容。

就您而言,Yelp 似乎不允许访问您尝试抓取的页面。您可以从以下链接检查robots.txt文件

https://www.yelp.com/robots.txt

© www.soinside.com 2019 - 2024. All rights reserved.