没有调用process_item管道

问题描述 投票:0回答:1

我是python和scrapy的新手

在抓取过程后我试图将数据库保存到mysqlite,请按照此src:https://github.com/sunshineatnoon/Scrapy-Amazon-Sqlite(来自url

我的问题是数据库已成功创建,但项目无法插入数据库,因为未调用process_item

编辑

我从setup.py上面的github链接粘贴源代码

ITEM_PIPELINES = {
    'amazon.pipelines.AmazonPipeline': 300
    }

pipelines.朋友

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import sqlite3
import os
con = None


class AmazonPipeline(object):

    def __init__(self):
        self.setupDBCon()
        self.createTables()

    def process_item(self, item, spider):
        print('---------------process item----')
        self.storeInDb(item)
        return item

    def setupDBCon(self):
        self.con = sqlite3.connect(os.getcwd() + '/test.db')
        self.cur = self.con.cursor()

    def createTables(self):
        self.dropAmazonTable()
        self.createAmazonTable()

    def dropAmazonTable(self):
        #drop amazon table if it exists
        self.cur.execute("DROP TABLE IF EXISTS Amazon")

    def closeDB(self):
        self.con.close()

    def __del__(self):
        self.closeDB()

    def createAmazonTable(self):
        self.cur.execute("CREATE TABLE IF NOT EXISTS Amazon(id INTEGER PRIMARY KEY NOT NULL, \
            name TEXT, \
            path TEXT, \
            source TEXT \
            )")

        self.cur.execute("INSERT INTO Amazon(name, path, source ) VALUES( 'Name1', 'Path1', 'Source1')")
        print ('------------------------')
        self.con.commit()      


    def storeInDb(self,item):
        # self.cur.execute("INSERT INTO Amazon(\
        #     name, \
        #     path, \
        #     source \
        #     ) \
        # VALUES( ?, ?, ?)", \
        # ( \
        #     item.get('Name',''),
        #     item.get('Path',''),
        #     item.get('Source','')
        # ))

        self.cur.execute("INSERT INTO Amazon(name, path, source ) VALUES( 'Name1', 'Path1', 'Source1')")

        print ('------------------------')
        print ('Data Stored in Database')
        print ('------------------------')
        self.con.commit()  

蜘蛛/ amazonspider.py

import scrapy
import urllib
from amazon.items import AmazonItem
import os


class amazonSpider(scrapy.Spider):
    imgcount = 1
    name = "amazon"
    allowed_domains = ["amazon.com"]
    '''
    start_urls = ["http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=backpack",
                  "http://www.amazon.com/s/ref=sr_pg_2?rh=i%3Aaps%2Ck%3Abackpack&page=2&keywords=backpack&ie=UTF8&qid=1442907452&spIA=B00YCRMZXW,B010HWLMMA"
                  ]
    '''
    def start_requests(self):
        yield scrapy.Request("http://www.amazon.com/s/ref=sr_ex_n_3?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&bbn=10445813011&ie=UTF8&qid=1442910853&ajr=0",self.parse)

        for i in range(2,3):
            yield scrapy.Request("http://www.amazon.com/s/ref=lp_360832011_pg_2?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&page="+str(i)+"&bbn=10445813011&ie=UTF8&qid=1442910987",self.parse)



    def parse(self,response):
        #namelist = response.xpath('//a[@class="a-link-normal s-access-detail-page  a-text-normal"]/@title').extract()
        #htmllist = response.xpath('//a[@class="a-link-normal s-access-detail-page  a-text-normal"]/@href').extract()
        #imglist = response.xpath('//a[@class="a-link-normal a-text-normal"]/img/@src').extract()
        namelist = response.xpath('//a[@class="a-link-normal s-access-detail-page s-overflow-ellipsis a-text-normal"]/@title').extract()
        htmllist = response.xpath('//a[@class="a-link-normal s-access-detail-page s-overflow-ellipsis a-text-normal"]/@href').extract()
        imglist = response.xpath('//img[@class="s-access-image cfMarker"]/@src').extract()
        listlength = len(namelist)

        pwd = os.getcwd()+'/'

        if not os.path.isdir(pwd+'crawlImages/'):
            os.mkdir(pwd+'crawlImages/')

        for i in range(0,listlength):
            item = AmazonItem()
            item['Name'] = namelist[i]
            item['Source'] = htmllist[i]

            urllib.urlretrieve(imglist[i],pwd+"crawlImages/"+str(amazonSpider.imgcount)+".jpg")
            item['Path'] = pwd+"crawlImages/"+str(amazonSpider.imgcount)+".jpg"
            amazonSpider.imgcount = amazonSpider.imgcount + 1
            yield item

运行scrapy crawl amazone后的结果我已经创建了test.db但是没有插入项目(我已经检查了我的sqlite db.test),这意味着process_item没有运行

建立结果

2018-09-18 16:38:38 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: amazon)
2018-09-18 16:38:38 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 17.5.0, Python 3.6.5 |Anaconda, Inc.| (default, Apr 26 2018, 08:42:37) - [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2o  27 Mar 2018), cryptography 2.2.2, Platform Darwin-17.7.0-x86_64-i386-64bit
2018-09-18 16:38:38 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'amazon', 'NEWSPIDER_MODULE': 'amazon.spiders', 'SPIDER_MODULES': ['amazon.spiders']}
2018-09-18 16:38:38 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2018-09-18 16:38:38 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-09-18 16:38:38 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
------------------------
2018-09-18 16:38:38 [scrapy.middleware] INFO: Enabled item pipelines:
['amazon.pipelines.AmazonPipeline']
2018-09-18 16:38:38 [scrapy.core.engine] INFO: Spider opened
2018-09-18 16:38:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-09-18 16:38:38 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-09-18 16:38:38 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.amazon.com/s/ref=lp_360832011_pg_2?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&page=2&bbn=10445813011&ie=UTF8&qid=1442910987> from <GET http://www.amazon.com/s/ref=lp_360832011_pg_2?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&page=2&bbn=10445813011&ie=UTF8&qid=1442910987>
2018-09-18 16:38:38 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.amazon.com/s/ref=sr_ex_n_3?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&bbn=10445813011&ie=UTF8&qid=1442910853&ajr=0> from <GET http://www.amazon.com/s/ref=sr_ex_n_3?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&bbn=10445813011&ie=UTF8&qid=1442910853&ajr=0>
2018-09-18 16:38:39 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.amazon.com/backpacks/b?ie=UTF8&node=360832011> from <GET https://www.amazon.com/s/ref=sr_ex_n_3?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&bbn=10445813011&ie=UTF8&qid=1442910853&ajr=0>
2018-09-18 16:38:39 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.amazon.com/Backpacks-Luggage-Travel-Gear/s?ie=UTF8&page=2&rh=n%3A360832011> from <GET https://www.amazon.com/s/ref=lp_360832011_pg_2?rh=n%3A7141123011%2Cn%3A10445813011%2Cn%3A9479199011%2Cn%3A360832011&page=2&bbn=10445813011&ie=UTF8&qid=1442910987>
2018-09-18 16:38:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/Backpacks-Luggage-Travel-Gear/s?ie=UTF8&page=2&rh=n%3A360832011> (referer: None)
2018-09-18 16:38:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/backpacks/b?ie=UTF8&node=360832011> (referer: None)
2018-09-18 16:38:41 [scrapy.core.engine] INFO: Closing spider (finished)
2018-09-18 16:38:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1909,
 'downloader/request_count': 6,
 'downloader/request_method_count/GET': 6,
 'downloader/response_bytes': 140740,
 'downloader/response_count': 6,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/301': 4,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 9, 18, 9, 38, 41, 53948),
 'log_count/DEBUG': 7,
 'log_count/INFO': 7,
 'memusage/max': 52600832,
 'memusage/startup': 52600832,
 'response_received_count': 2,
 'scheduler/dequeued': 6,
 'scheduler/dequeued/memory': 6,
 'scheduler/enqueued': 6,
 'scheduler/enqueued/memory': 6,
 'start_time': datetime.datetime(2018, 9, 18, 9, 38, 38, 677280)}
2018-09-18 16:38:41 [scrapy.core.engine] INFO: Spider closed (finished)

我一直在寻找,但没有运气

谢谢

python sqlite scrapy pipeline
1个回答
1
投票

如果我爬行

https://www.amazon.com/backpacks/b?ie=UTF8&node=360832011

我没有在名单和htmllist中得到任何结果。小丑被填满了。检查html代码:

... <a class="a-link-normal s-access-detail-page s-overflow-ellipsis s-color-twister-title-link a-text-normal" ...

我发现了一个额外的“s-color-twister-title-link”,因此您的特定xpath不正确。您可以添加s-color-twister-title-link

In [9]: response.xpath('//a[@class="a-link-normal s-access-detail-page s-overflow-ellipsis s-
   ...: color-twister-title-link a-text-normal"]/@title').extract()
Out[9]: 
['Anime Anti-theft Backpack, Luminous School Bag, Waterproof Laptop Backpack with USB Charging Port, Unisex 15.6 Inch College Daypack, Starry',
 'Anime Luminous Backpack Noctilucent School Bags Daypack USB chargeing Port Laptop Bag Handbag for Boys Girls Men Women',

或者你可以使用更具体的一个,如:

response.xpath('//a[contains(@class,"s-access-detail-page")]/@title').extract()
© www.soinside.com 2019 - 2024. All rights reserved.