如何避免刮擦两次运行同一蜘蛛?

问题描述 投票:0回答:2

所以我正在按照doc在代码中运行Spider,但是由于某种原因,在它完成爬网之后,再次运行了Spider。我试过添加stop_after_crawl和stop()函数,但是没有运气。尝试再次运行后,它也给我下面的错误。

twisted.internet.error.ReactorNotRestartable

感谢您的任何帮助,谢谢!

The代码

class DocSpider(scrapy.Spider):
"""
This is the broad scraper, the name is doc_spider and can be invoked by making an object
of the CrawlerProcess() then calling the class of the Spider. It scrapes websites csv file
for the content and returns the results as a .json file.
"""

#Name of Spider
name = 'doc_spider'

#File of the URL list here
urlsList = pd.read_csv('B:\docubot\DocuBots\Model\Data\linksToScrape.csv')
urls = []
#Take the urls and insert them into a url list
for url in urlsList['urls']:
    urls.append(url)

#Scrape through all the websites in the urls list
start_urls = urls

#This method will parse the results and will be called automatically
def parse(self, response):
    data = {}
    #Iterates through all <p> tags
    for content in response.xpath('/html//body//div[@class]//div[@class]//p'):
        if content:
            #Append the current url
            data['links'] = response.request.url
            #Append the texts within the <p> tags
            data['texts'] = " ".join(content.xpath('//p/text()').extract())

    yield data

def run_crawler(self):
    settings = get_project_settings()
    settings.set('FEED_FORMAT', 'json')
    settings.set('FEED_URI', 'scrape_results.json')
    c = CrawlerProcess(settings)
    c.crawl(DocSpider)
    c.start(stop_after_crawl=True)

D = DocSpider()
D.run_crawler()

错误终端输出

Traceback (most recent call last):
File "web_scraper.py", line 52, in <module>
D.run_crawler()
File "web_scraper.py", line 48, in run_crawler
c.start(stop_after_crawl=True)
File "B:\Python\lib\site-packages\scrapy\crawler.py", line 312, in start
reactor.run(installSignalHandlers=False)  # blocking call
File "B:\Python\lib\site-packages\twisted\internet\base.py", line 1282, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "B:\Python\lib\site-packages\twisted\internet\base.py", line 1262, in startRunning
ReactorBase.startRunning(self)
File "B:\Python\lib\site-packages\twisted\internet\base.py", line 765, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
python scrapy web-crawler data-collection
2个回答
0
投票

您需要将run_spider移到DocSpider类的外面:

class DocSpider(scrapy.Spider):
    .....

def run_crawler(self):
    settings = get_project_settings()
    settings.set('FEED_FORMAT', 'json')
    settings.set('FEED_URI', 'scrape_results.json')
    c = CrawlerProcess(settings)
    c.crawl(DocSpider)
    c.start(stop_after_crawl=True)


run_crawler()

0
投票

SOLUTION

找到了解决方案,显然每次我导入代码时,scrapy都会再次运行蜘蛛。因此,我必须通过添加if语句来指定仅在运行代码时才运行蜘蛛。

    def run_crawler(self):
       if __name__ ==  "__main__":
           settings = get_project_settings()
           settings.set('FEED_FORMAT', 'json')
           settings.set('FEED_URI', 'scrape_results.json')
           c = CrawlerProcess(settings)
           c.crawl(DocSpider)
           c.start(stop_after_crawl=True)

newProc = DocSpider()
newProc.run_crawler()
© www.soinside.com 2019 - 2024. All rights reserved.