使用Scrapy从网站查找并下载多个pdf文件

问题描述 投票:0回答:1

我需要使用 Scrapy 从网站下载多个 pdf 文件。我是Python新手,Scrapy对我来说也很陌生。我一直在尝试控制台和一些基本的蜘蛛。

我已按照此主题找到并修改了此代码:

import urllib.parse
import scrapy

from scrapy.http import Request

class pwc_tax(scrapy.Spider):
    name = "pdfspider"

    allowed_domains = ["www.has-sante.fr"]
    start_urls = ["https://www.has-sante.fr/jcms/fc_2875208/fr/rechercher-une-recommandation-un-avis?histstate=1"]

    def parse(self, response):
        for href in response.css('div#all_results h3 a::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.parse_article
            )

    def parse_article(self, response):
        for href in response.css('div.download_wrapper a[href$=".pdf"]::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.save_pdf
            )

    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        self.logger.info('Saving PDF %s', path)
        with open(path, 'wb') as f:
            f.write(response.body)

我在命令行中运行此代码

scrapy crawl mySpider

我得到的结果不是我所期望的

/usr/local/lib/python3.6/site-packages/OpenSSL/_util.py:6: CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team. Therefore, support for it is deprecated in cryptography. The next release of cryptography will remove support for Python 3.6.
  from cryptography.hazmat.bindings.openssl.binding import Binding
2023-09-12 18:13:25 [scrapy.utils.log] INFO: Scrapy 2.6.3 started (bot: mypdfscraper)
2023-09-12 18:13:25 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.9.13, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 21.2.0, Python 3.6.5 (default, Mar 30 2018, 06:41:53) - [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)], pyOpenSSL 23.2.0 (OpenSSL 3.1.0 14 Mar 2023), cryptography 40.0.2, Platform Darwin-22.6.0-x86_64-i386-64bit
2023-09-12 18:13:25 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'mypdfscraper',
 'NEWSPIDER_MODULE': 'mypdfscraper.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['mypdfscraper.spiders']}
2023-09-12 18:13:25 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2023-09-12 18:13:25 [scrapy.extensions.telnet] INFO: Telnet Password: e919a38844a0a4a1
2023-09-12 18:13:25 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2023-09-12 18:13:25 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-09-12 18:13:25 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-09-12 18:13:25 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-09-12 18:13:25 [scrapy.core.engine] INFO: Spider opened
2023-09-12 18:13:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-09-12 18:13:25 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-09-12 18:13:25 [filelock] DEBUG: Attempting to acquire lock 4408289880 on /Users/alexismathieu/.cache/python-tldextract/3.6.5.final__3.6__a9d949__tldextract-3.1.2/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2023-09-12 18:13:25 [filelock] DEBUG: Lock 4408289880 acquired on /Users/alexismathieu/.cache/python-tldextract/3.6.5.final__3.6__a9d949__tldextract-3.1.2/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2023-09-12 18:13:25 [filelock] DEBUG: Attempting to acquire lock 4408290944 on /Users/alexismathieu/.cache/python-tldextract/3.6.5.final__3.6__a9d949__tldextract-3.1.2/urls/62bf135d1c2f3d4db4228b9ecaf507a2.tldextract.json.lock
2023-09-12 18:13:25 [filelock] DEBUG: Lock 4408290944 acquired on /Users/alexismathieu/.cache/python-tldextract/3.6.5.final__3.6__a9d949__tldextract-3.1.2/urls/62bf135d1c2f3d4db4228b9ecaf507a2.tldextract.json.lock
2023-09-12 18:13:25 [filelock] DEBUG: Attempting to release lock 4408290944 on /Users/alexismathieu/.cache/python-tldextract/3.6.5.final__3.6__a9d949__tldextract-3.1.2/urls/62bf135d1c2f3d4db4228b9ecaf507a2.tldextract.json.lock
2023-09-12 18:13:25 [filelock] DEBUG: Lock 4408290944 released on /Users/alexismathieu/.cache/python-tldextract/3.6.5.final__3.6__a9d949__tldextract-3.1.2/urls/62bf135d1c2f3d4db4228b9ecaf507a2.tldextract.json.lock
2023-09-12 18:13:25 [filelock] DEBUG: Attempting to release lock 4408289880 on /Users/alexismathieu/.cache/python-tldextract/3.6.5.final__3.6__a9d949__tldextract-3.1.2/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2023-09-12 18:13:25 [filelock] DEBUG: Lock 4408289880 released on /Users/alexismathieu/.cache/python-tldextract/3.6.5.final__3.6__a9d949__tldextract-3.1.2/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2023-09-12 18:13:25 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.has-sante.fr/robots.txt> (referer: None)
2023-09-12 18:13:25 [protego] DEBUG: Rule at line 4 without any user agent to enforce it on.
2023-09-12 18:13:25 [protego] DEBUG: Rule at line 7 without any user agent to enforce it on.
2023-09-12 18:13:25 [protego] DEBUG: Rule at line 8 without any user agent to enforce it on.
2023-09-12 18:13:25 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on.
2023-09-12 18:13:25 [protego] DEBUG: Rule at line 16 without any user agent to enforce it on.
2023-09-12 18:13:25 [protego] DEBUG: Rule at line 28 without any user agent to enforce it on.
2023-09-12 18:13:25 [protego] DEBUG: Rule at line 39 without any user agent to enforce it on.
2023-09-12 18:13:25 [protego] DEBUG: Rule at line 40 without any user agent to enforce it on.
2023-09-12 18:13:25 [protego] DEBUG: Rule at line 42 without any user agent to enforce it on.
2023-09-12 18:13:25 [protego] DEBUG: Rule at line 44 without any user agent to enforce it on.
2023-09-12 18:13:25 [protego] DEBUG: Rule at line 45 without any user agent to enforce it on.
2023-09-12 18:13:25 [protego] DEBUG: Rule at line 46 without any user agent to enforce it on.
2023-09-12 18:13:25 [protego] DEBUG: Rule at line 74 without any user agent to enforce it on.
2023-09-12 18:13:25 [protego] DEBUG: Rule at line 92 without any user agent to enforce it on.
2023-09-12 18:13:25 [protego] DEBUG: Rule at line 130 without any user agent to enforce it on.
2023-09-12 18:13:25 [protego] DEBUG: Rule at line 134 without any user agent to enforce it on.
2023-09-12 18:13:25 [protego] DEBUG: Rule at line 141 without any user agent to enforce it on.
2023-09-12 18:13:25 [protego] DEBUG: Rule at line 147 without any user agent to enforce it on.
2023-09-12 18:13:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.has-sante.fr/jcms/fc_2875208/fr/rechercher-une-recommandation-un-avis?histstate=1> (referer: None)
2023-09-12 18:13:26 [scrapy.core.engine] INFO: Closing spider (finished)
2023-09-12 18:13:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 563,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 66290,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 1.392696,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 9, 12, 16, 13, 26, 840182),
 'httpcompression/response_bytes': 103403,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 29,
 'log_count/INFO': 10,
 'memusage/max': 54800384,
 'memusage/startup': 54800384,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2023, 9, 12, 16, 13, 25, 447486)}
2023-09-12 18:13:26 [scrapy.core.engine] INFO: Spider closed (finished)

我想抓取并下载文件,没有元数据。如果有任何帮助,我将不胜感激。

python pdf scrapy
1个回答
0
投票

要下载音乐,您可以使用默认媒体管道https://docs.scrapy.org/en/latest/topics/media-pipeline.html

它看起来像这样(但代码可能需要改进):

import scrapy
from scrapy.pipelines.files import FilesPipeline
from scrapy.http import Request

class pwc_tax(scrapy.Spider):
name = "pdfspider"

allowed_domains = ["www.has-sante.fr"]
start_urls = ["https://www.has-sante.fr/jcms/fc_2875208/fr/rechercher-une-recommandation-un-avis?histstate=1"]

custom_settings = {
    'FILES_URLS_FIELD': 'file_urls',
    'FILES_RESULT_FIELD': 'files',
    'FILES_STORE': 'media/pdf_files',
    'MEDIA_URL': '/media/',
    'MEDIA_ALLOW_REDIRECTS': True,
    'ITEM_PIPELINES': {
        'scrapy.pipelines.media.MediaPipeline': 1,
    }
}

def parse(self, response):
    for href in response.css('div#all_results h3 a::attr(href)').extract():
        yield Request(
            url=response.urljoin(href),
            callback=self.parse_article
        )

def parse_article(self, response):
    item = {}
    item['file_urls'] = response.css('div.download_wrapper a[href$=".pdf"]::attr(href)').extract()
    yield item
© www.soinside.com 2019 - 2024. All rights reserved.