动态调用Spider的Scrapy Spider不会将任何输出保存到所需的文件夹

问题描述 投票:0回答:1

我想通过在 Visual Studio 代码中按 F5 来运行“___SPIDER_RUNNER.py”。一切似乎都正常爬行,日志记录显示正在检索项目,但输出 JSON 文件未保存到文件夹 C:\scrapy\JSON_output。 该文件夹存在。我有写权限。

我完全陷入困境,因为没有记录任何错误。

我在文件 _singlepage_nonAJAX.py 中尝试了不同的路径:

    'FEED_URI': 'C:/scrapy/JSON_output/test.json'
    'FEED_URI': r'C:\scrapy\JSON_output\test.json'
    'FEED_URI': f'C:\\scrapy\\JSON_output\\{self.name}.json'

我尝试从settings.py中删除ITEM_PIPELINES和FEED_EXPORT_FIELDS设置

我的文件夹结构如下:

- C:\scrapy\my_spiders\___SPIDER_RUNNER.py
- C:\scrapy\my_spiders\__init__.py
- C:\scrapy\my_spiders\spiders\__init__.py
- C:\scrapy\my_spiders\spiders\_singlepage_nonAJAX.py

所有“init.py”文件不包含任何代码。

___SPIDER_RUNNER.py

import sys
sys.path.append('C:\\scrapy')
from scrapy.crawler import CrawlerProcess
from my_spiders.spiders._singlepage_nonAJAX import SinglePageNonAJAXSpider
import logging
logging.basicConfig(level=logging.DEBUG)

def run_spider(myname, start_urls, SERP_item, url, itemstatus, okstatus, title):
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })

    process.crawl(SinglePageNonAJAXSpider, 
                  myname=myname, 
                  start_urls=start_urls, 
                  SERP_item=SERP_item, 
                  url=url, 
                  itemstatus=itemstatus, 
                  okstatus=okstatus, 
                  title=title)
    process.start()

run_spider("toscrape", 
           "https://quotes.toscrape.com",
            "//div[@class='quote']/span/a[starts-with(@href, '/author/')]",
            "./@href",
            ""''"",
            "", 
            '//span[contains(@class, "author-born-date")]/text()')

_singlepage_nonAJAX.py

import json
import re
import os
import scrapy
import time
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from lxml import html

class RentalItem(scrapy.Item):
    city = scrapy.Field()
    url = scrapy.Field() 

class SinglePageNonAJAXSpider(scrapy.Spider):
    name = 'whatever'

    def __init__(self, myname=None, start_urls=None, SERP_item=None, url=None, itemstatus=None, okstatus=None, title=None, *args, **kwargs):
        super(SinglePageNonAJAXSpider, self).__init__(*args, **kwargs)
        if myname:
            self.name = myname
        if start_urls:
            self.start_urls = [start_urls]  # Assuming only one URL
        self.SERP_item = SERP_item
        self.url = url
        self.itemstatus = itemstatus
        self.okstatus = okstatus
        self.title = title
        self.custom_settings = {
            'FEED_FORMAT': 'json',
            'FEED_URI': f'C:\\scrapy\\JSON_output\\test.json'
        }

    def parse(self, response):
        for listing in response.xpath(self.SERP_item):
            listing_url = listing.xpath(self.url).get()
            yield scrapy.Request(
                url=response.urljoin(listing_url), 
                callback=self.parse_object,
            )

    def parse_object(self, response):
        item = RentalItem()
        item['url'] = response.url # get url
        item['city'] = 'mycity'
        yield item

管道.py

import json

class MyCustomPipeline(object):

    def open_spider(self, spider):
        self.items = []

    def process_item(self, item, spider):            
        self.items.append(dict(item))
        return item

中间件.py

from scrapy import signals
from itemadapter import is_item, ItemAdapter

class MySpiderMiddleware:
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        return None

    def process_spider_output(self, response, result, spider):
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        pass

    def process_start_requests(self, start_requests, spider):
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

class MyDownloaderMiddleware:

    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        return None

    def process_response(self, request, response, spider):
        return response

    def process_exception(self, request, exception, spider):
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

设置.py

BOT_NAME = 'my_spiders'

SPIDER_MODULES = ['my_spiders.spiders']
NEWSPIDER_MODULE = 'my_spiders.spiders'

ROBOTSTXT_OBEY = False

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100
}

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware': 500,
    'scrapy_selenium.SeleniumMiddleware': 800
} 

from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS=['--headless']


#Configure item pipelines. See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'my_spiders.pipelines.MyCustomPipeline': 300,
}

FEED_EXPORT_FIELDS = [
'id', 'url', 'city', 'title'
]

SPLASH_URL = 'http://localhost:8050/' 
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
python scrapy
1个回答
0
投票
  1. 删除之前所有关于FEEDS的设置
  2. 使用settings.py或custom_settings来设置您的设置。

对于

settings.py

FEEDS = {
    'items.json': {
        'format': 'json',
        'encoding': 'utf8',
        'fields': None,
        'indent': 4,
        'item_export_kwargs': {
            'export_empty_fields': True,
        },
    },
}

对于

custom_settings

class SinglePageNonAJAXSpider(scrapy.Spider):
    name = 'whatever'
    custom_settings = {
        'FEEDS': {
            'items.json': {
                'format': 'json',
                'encoding': 'utf8',
                'fields': None,
                'indent': 4,
                'item_export_kwargs': {
                    'export_empty_fields': True,
                },
            },
        },
    }

    def __init__...
  1. 运行蜘蛛,它会返回一些物品。
  2. 完成它或期待完成。
  3. 检查您的数据是否写入根 scrapy 文件夹中的 items.json 文件中。
  4. 如果可以,请尝试添加您的路径('
    C:\scrapy\JSON_output\test.json
    ')而不是'
    items.json
    '

它应该对你有帮助。如果不是这种情况,请通过

__init__
 检查蜘蛛的 
self.crawler.settings

方法中的设置

有关设置的更多信息: https://docs.scrapy.org/en/latest/topics/feed-exports.html?highlight=FEED#settings

© www.soinside.com 2019 - 2024. All rights reserved.