将已删除的数据导出到CSV文件

问题描述 投票:0回答:1

我正在尝试从网站获取数据,要求我在抓取数据之前遵循2个网址。

目标是获得如下所示的导出文件:

我的代码如下:

import scrapy
from scrapy.item import Item, Field
from scrapy import Request

class myItems(Item):
    info1 = Field()
    info2 = Field()
    info3 = Field()
    info4 = Field()

class mySpider(scrapy.Spider):
    name = 'techbot'
    start_urls = ['']

    def parse(self, response):
        #Extracts first link
        items = []

        list1 = response.css("").extract() #extract all info from here

        for i in list1:
            link1 = 'https:...' + str(i)
            request = Request(link1, self.parseInfo1, dont_filter =True)
            request.meta['item'] = items
            yield request

        yield items

    def parseInfo1(self, response):
        #Extracts second link
        item = myItems()
        items = response.meta['item']

        list1 = response.css("").extract()
        for i in list1:
            link1 = '' + str(i)
            request = Request(link1, self.parseInfo2, dont_filter =True)
            request.meta['item'] = items
            items.append(item)
            return request

    def parseInfo2(self, response):
        #Extracts all data
        item = myItems()
        items = response.meta['item']
        item['info1'] = response.css("").extract()
        item['info2'] = response.css("").extract()
        item['info3'] = response.css("").extract()
        item['info4'] = response.css("").extract()
        items.append(item)
        return items

我用命令在终端中执行了蜘蛛:

scrapy crawl techbot

我得到的数据不正常,并且有这样的间隙:

例如,它多次刮擦第一组数据,其余数据无序。

如果有人能指出我的方向,以更清晰的格式获得结果,如开头所示,将非常感激。

谢谢

python csv web-scraping scrapy
1个回答
1
投票

通过将以下两个链接合并为一个函数而不是两个函数来解决它。我的蜘蛛现在正在运作如下:

class mySpider(scrapy.Spider):
name = 'techbot'
start_urls = ['']

def parse(self, response):
    #Extracts links
    items = []

    list1 = response.css("").extract()
    for i in list1:
        link1 = 'https:...' + str(i)
        request = Request(link2, self.parse, dont_filter =True)
        request.meta['item'] = items
        yield request

    list2 = response.css("").extract()
    for i in list2:
        link2 = '' + str(i)
        request = Request(link1, self.parseInfo2, dont_filter =True)
        request.meta['item'] = items
        yield request

    yield items

def parseInfo2(self, response):
    #Extracts all data
    item = myItems()
    items = response.meta['item']
    item['info1'] = response.css("").extract()
    item['info2'] = response.css("").extract()
    item['info3'] = response.css("").extract()
    item['info4'] = response.css("").extract()
    items.append(item)
    return items
© www.soinside.com 2019 - 2024. All rights reserved.