使用Scrapy sitemap spider，告诉我如何抓取文章标题

Question

我正在尝试抓取华盛顿邮报站点地图，找到标题为“特朗普”的文章。我在这里做了我的研究https://scrapy.readthedocs.io/en/latest/topics/spiders.html#sitemapspider，但我正在努力重新创建这个例子。

我的代码

from scrapy.spiders import SitemapSpider
class SiteSpider(SitemapSpider):
   name = 'SiteSpider'
   sitemap_urls = ['http://www.washingtonpost.com/news-politics-sitemap.xml']
   sitemap_rule = [
     ('/trump/', 'parse_article'),
   ]

  def parse_article(self, response):
      print "<---- HERE ----->\n\n"
      with open("url.txt", "a") as myfile:
          myfile.write("\n"+response.url)

从下面的堆栈跟踪中可以看出，我的代码引发了一个Nonimplemented错误。即使文章在url中有单词trump，它也会引发未实现的错误。怎么了？

2018-01-02 19:07:12 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.washingtonpost.com/video/politics/white-house-doubles-down-on-crediting-trump-for-zero-commercial-airline-deaths/2018/01/02/40d0a4a8-effa-11e7-95e3-eff284e71c8d_video.html> (referer: https://www.washingtonpost.com/news-politics-sitemap.xml)
Traceback (most recent call last):
  File "/home/rahmi/Documents/Honor_Project/scrapyenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/rahmi/Documents/Honor_Project/scrapyenv/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 90, in parse
    raise NotImplementedError
NotImplementedError
2018-01-02 19:07:12 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.washingtonpost.com/video/politics/sanders-president-to-keep-options-open-on-iran-sanctions/2018/01/02/51ee9a00-f000-11e7-95e3-eff284e71c8d_video.html> (referer: https://www.washingtonpost.com/news-politics-sitemap.xml)
Traceback (most recent call last):
  File "/home/rahmi/Documents/Honor_Project/scrapyenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/rahmi/Documents/Honor_Project/scrapyenv/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 90, in parse
    raise NotImplementedError
NotImplementedError
2018-01-02 19:07:12 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.washingtonpost.com/politics/hatch-announces-he-will-not-seek-re-election/2018/01/02/8f475468-eff2-11e7-95e3-eff284e71c8d_story.html> (referer: https://www.washingtonpost.com/news-politics-sitemap.xml)
Traceback (most recent call last):
  File "/home/rahmi/Documents/Honor_Project/scrapyenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/rahmi/Documents/Honor_Project/scrapyenv/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 90, in parse
    raise NotImplementedError
NotImplementedError
2018-01-02 19:07:12 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.washingtonpost.com/politics/how-far-is-trump-willing-to-go-on-iran-amid-widespread-protests/2018/01/02/66c0e4a0-efcf-11e7-b390-a36dc3fa2842_story.html> (referer: https://www.washingtonpost.com/news-politics-sitemap.xml)
Traceback (most recent call last):
  File "/home/rahmi/Documents/Honor_Project/scrapyenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/rahmi/Documents/Honor_Project/scrapyenv/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 90, in parse
    raise NotImplementedError
NotImplementedError
2018-01-02 19:07:12 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.washingtonpost.com/politics/science-says-why-theres-a-big-chill-in-a-warmer-world/2018/01/02/0915cdf6-f016-11e7-95e3-eff284e71c8d_story.html> (referer: https://www.washingtonpost.com/news-politics-sitemap.xml)
Traceback (most recent call last):
  File "/home/rahmi/Documents/Honor_Project/scrapyenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/rahmi/Documents/Honor_Project/scrapyenv/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 90, in parse
    raise NotImplementedError
NotImplementedError
2018-01-02 19:07:12 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.washingtonpost.com/video/politics/with-hatchs-retirement-trump-is-losing-and-ally--and-might-be-gaining-a-foe/2018/01/02/abaa60dc-f015-11e7-95e3-eff284e71c8d_video.html> (referer: https://www.washingtonpost.com/news-politics-sitemap.xml)
Traceback (most recent call last):
  File "/home/rahmi/Documents/Honor_Project/scrapyenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/rahmi/Documents/Honor_Project/scrapyenv/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 90, in parse
    raise NotImplementedError
NotImplementedError
2018-01-02 19:07:12 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.washingtonpost.com/local/virginia-politics/in-a-young-county-a-millennial-takes-the-helm-as-board-chairman/2018/01/02/70b13d40-ec17-11e7-b698-91d4e35920a3_story.html> (referer: https://www.washingtonpost.com/news-politics-sitemap.xml)
Traceback (most recent call last):
  File "/home/rahmi/Documents/Honor_Project/scrapyenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/rahmi/Documents/Honor_Project/scrapyenv/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 90, in parse
    raise NotImplementedError
NotImplementedError
2018-01-02 19:07:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.washingtonpost.com/local/md-politics/democrats-slam-hogan-over-rga-donation-from-poultry-company/2018/01/02/db8e6172-ef61-11e7-b3bf-ab90a706e175_story.html> (referer: https://www.washingtonpost.com/news-politics-sitemap.xml)
2018-01-02 19:07:12 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.washingtonpost.com/local/md-politics/democrats-slam-hogan-over-rga-donation-from-poultry-company/2018/01/02/db8e6172-ef61-11e7-b3bf-ab90a706e175_story.html> (referer: https://www.washingtonpost.com/news-politics-sitemap.xml)
Traceback (most recent call last):
  File "/home/rahmi/Documents/Honor_Project/scrapyenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/rahmi/Documents/Honor_Project/scrapyenv/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 90, in parse
    raise NotImplementedError
NotImplementedError
2018-01-02 19:07:12 [scrapy.core.engine] INFO: Closing spider (finished)
2018-01-02 19:07:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 36013,
 'downloader/request_count': 83,
 'downloader/request_method_count/GET': 83,
 'downloader/response_bytes': 2127377,
 'downloader/response_count': 83,
 'downloader/response_status_count/200': 57,
 'downloader/response_status_count/301': 26,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 1, 3, 0, 7, 12, 651303),
 'log_count/DEBUG': 84,
 'log_count/ERROR': 55,
 'log_count/INFO': 7,
 'memusage/max': 52187136,
 'memusage/startup': 52187136,
 'request_depth_max': 1,
 'response_received_count': 57,
 'scheduler/dequeued': 81,
 'scheduler/dequeued/memory': 81,
 'scheduler/enqueued': 81,
 'scheduler/enqueued/memory': 81,
 'spider_exceptions/NotImplementedError': 55,
 'start_time': datetime.datetime(2018, 1, 3, 0, 7, 10, 174415)}

Answer 1

你在s结束时忘记了char sitemap_rules并且它造成了问题。

您不必手动写入文件，因为scrapy可以保存在csv，xml或json中。即。

scrapy SiteSpider -o output.csv

你必须只有一行数据的yield字典。

无需项目即可运行的工作代码。它保存在output.csv。因为没有/trump/所以我使用了`trump。

from scrapy.spiders import SitemapSpider

class SiteSpider(SitemapSpider):

    name = 'SiteSpider'

    sitemap_urls = ['http://www.washingtonpost.com/news-politics-sitemap.xml']
    sitemap_rules = [('trump', 'parse_article')]

    def parse_article(self, response):
        print('parse_article url:', response.url)

        yield {'url': response.url}

# --- it runs without project and saves in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',

    # save in file as CSV, JSON or XML
    'FEED_FORMAT': 'csv',     # csv, json, xml
    'FEED_URI': 'output.csv', # 
})
c.crawl(SiteSpider)
c.start()

结果 - output.csv

url
https://www.washingtonpost.com/politics/trumps-irish-golf-course-lost-23-million-in-2016/2018/01/02/b410a14c-ef5b-11e7-b390-a36dc3fa2842_story.html
https://www.washingtonpost.com/video/politics/with-hatchs-retirement-trump-is-losing-and-ally--and-might-be-gaining-a-foe/2018/01/02/abaa60dc-f015-11e7-95e3-eff284e71c8d_video.html
https://www.washingtonpost.com/politics/trump-administration-calls-on-iran-to-unblock-instagram-other-social-media-amid-protests/2018/01/02/06374624-efe3-11e7-95e3-eff284e71c8d_story.html
https://www.washingtonpost.com/politics/federal_government/ap-fact-check-trump-claims-credit-for-aviation-death-trend/2018/01/02/7755c9b8-eff5-11e7-95e3-eff284e71c8d_story.html
https://www.washingtonpost.com/politics/the-latest-trump-says-his-nuclear-button-is-bigger/2018/01/02/a32d350c-f023-11e7-95e3-eff284e71c8d_story.html
https://www.washingtonpost.com/politics/trump-takes-hard-line-on-dreamers-but-remains-interested-in-a-deal/2018/01/02/45a47e20-efdf-11e7-b390-a36dc3fa2842_story.html
https://www.washingtonpost.com/politics/the-latest-white-house-says-trump-is-sad-hatch-is-retiring/2018/01/02/f41ad89a-eff9-11e7-95e3-eff284e71c8d_story.html
https://www.washingtonpost.com/politics/how-far-is-trump-willing-to-go-on-iran-amid-widespread-protests/2018/01/02/66c0e4a0-efcf-11e7-b390-a36dc3fa2842_story.html
https://www.washingtonpost.com/news/politics/wp/2018/01/02/trumps-claim-that-he-prevented-air-traffic-deaths-is-his-most-questionable-yet/
https://www.washingtonpost.com/video/politics/white-house-doubles-down-on-crediting-trump-for-zero-commercial-airline-deaths/2018/01/02/40d0a4a8-effa-11e7-95e3-eff284e71c8d_video.html
https://www.washingtonpost.com/news/fact-checker/wp/2018/01/02/president-trump-has-made-1949-false-or-misleading-claims-over-347-days/
https://www.washingtonpost.com/politics/trump-sounds-open-to-korea-dialogue-says-kim-feels-pressure/2018/01/02/c55f702e-efe0-11e7-95e3-eff284e71c8d_story.html
https://www.washingtonpost.com/video/politics/sanders-wont-say-if-trump-is-open-to-supporting-romney/2018/01/02/52975c8c-eff9-11e7-95e3-eff284e71c8d_video.html
https://www.washingtonpost.com/video/politics/trump-we-are-going-to-have-a-tremendous-year/2017/12/31/d20c23a8-ee9b-11e7-95e3-eff284e71c8d_video.html
https://www.washingtonpost.com/video/politics/see-trumps-new-years-eve-party-at-mar-a-lago/2017/12/31/7f710988-eea5-11e7-95e3-eff284e71c8d_video.html
https://www.washingtonpost.com/news/powerpost/paloma/the-energy-202/2018/01/02/the-energy-202-trump-took-a-long-break-this-december-his-environmental-deputies-did-not/5a4ac00e30fb0469e883fe4f/
https://www.washingtonpost.com/news/the-fix/wp/2018/01/02/with-orrin-hatch-retiring-trump-will-lose-a-major-ally-in-the-senate/
https://www.washingtonpost.com/news/the-fix/wp/2018/01/02/huma-abedin-and-14-other-people-trump-thinks-should-maybe-be-in-jail/
https://www.washingtonpost.com/politics/federal_government/perils-abroad-full-plate-at-home-as-trump-opens-2nd-year/2018/01/01/a580cb84-ef51-11e7-95e3-eff284e71c8d_story.html
https://www.washingtonpost.com/news/the-fix/wp/2018/01/02/democrats-arent-just-running-against-trump-why-do-people-think-they-are/
https://www.washingtonpost.com/news/post-politics/wp/2018/01/02/trump-urges-justice-department-to-act-on-comey-suggests-huma-abedin-should-face-jail-time/
https://www.washingtonpost.com/news/powerpost/paloma/the-finance-202/2018/01/02/the-finance-202-congress-has-hefty-to-do-list-to-kick-off-trump-s-second-year/5a4abf1630fb0469e883fe4e/
https://www.washingtonpost.com/news/powerpost/paloma/daily-202/2018/01/02/daily-202-trump-s-true-priorities-revealed-in-holiday-news-dumps/5a4af37830fb0469e883fe50/
https://www.washingtonpost.com/news/post-politics/wp/2018/01/02/trump-threatens-to-cut-off-u-s-aid-to-palestinians-over-jerusalem-row/

使用Scrapy sitemap spider，告诉我如何抓取文章标题

问题描述投票：-2回答：1

1个回答

最新问题

使用Scrapy sitemap spider，告诉我如何抓取文章标题

问题描述 投票：-2回答：1

1个回答

最新问题

问题描述投票：-2回答：1