如何使用Python请求获取pdf文件名？

Question

我正在使用 Python requests 库从网络获取 PDF 文件。这工作正常，但我现在还想要原始文件名。如果我在 Firefox 中打开 PDF 文件并单击

download

，它已经定义了用于保存 pdf 的文件名。我如何获得这个文件名？

例如：

import requests
r = requests.get('http://www.researchgate.net/profile/M_Gotic/publication/260197848_Mater_Sci_Eng_B47_%281997%29_33/links/0c9605301e48beda0f000000.pdf')
print r.headers['content-type']  # prints 'application/pdf'

我检查了

r.headers

是否有任何有趣的东西，但那里没有文件名。我实际上希望有类似的东西

r.filename

..

有人知道如何使用 requests 库获取下载的 PDF 文件的文件名吗？

Answer 1

它在 http 标头中指定

content-disposition

。因此，要提取名称，您需要执行以下操作：

import re
d = r.headers['content-disposition']
fname = re.findall("filename=(.+)", d)[0]

通过正则表达式从字符串中提取的名称（

re

模块）。

Answer 2

基于其他一些答案，我是这样做的。如果没有

Content-Disposition

标头，我会从下载 URL 中解析它：

import re
import requests
from requests.exceptions import RequestException


url = 'http://www.example.com/downloads/sample.pdf'

try:
    with requests.get(url) as r:

        fname = ''
        if "Content-Disposition" in r.headers.keys():
            fname = re.findall("filename=(.+)", r.headers["Content-Disposition"])[0]
        else:
            fname = url.split("/")[-1]

        print(fname)
except RequestException as e:
    print(e)

可以说有更好的方法来解析 URL 字符串，但为了简单起见，我不想涉及更多的库。

Answer 3

显然，对于这个特定资源，它位于：

r.headers['content-disposition']

但不知道是否总是如此。

Answer 4

简单的 python3 实现从 Content-Disposition 获取文件名：

import requests
response = requests.get(<your-url>)
print(response.headers.get("Content-Disposition").split("filename=")[1])

Answer 5

您可以使用

werkzeug

作为选项标头 https://werkzeug.palletsprojects.com/en/0.15.x/http/#werkzeug.http.parse_options_header

>>> import werkzeug


>>> werkzeug.http.parse_options_header('text/html; charset=utf8')
('text/html', {'charset': 'utf8'})

Answer 6

根据文档，

Content-Disposition

及其

filename

属性都不是必需的。另外，我检查了互联网上的数十个链接，但没有找到带有

Content-Disposition

标题的回复。因此，在大多数情况下，我不会太依赖它，只是从请求 URL 中检索此信息（注意：我从

req.url

获取它，因为可能存在重定向，并且我们想要获取 real 文件名）。我使用

werkzeug

因为它看起来更健壮并且可以处理带引号和不带引号的文件名。最终，我想出了这个解决方案（自 Python 3.8 起有效）：

from urllib.parse import urlparse

import requests
import werkzeug


def get_filename(url: str):
    try:
        with requests.get(url) as req:
            if content_disposition := req.headers.get("Content-Disposition"):
                param, options = werkzeug.http.parse_options_header(content_disposition)
                if param == 'attachment' and (filename := options.get('filename')):
                    return filename

            path = urlparse(req.url).path
            name = path[path.rfind('/') + 1:]
            return name
    except requests.exceptions.RequestException as e:
        raise e

我使用

pytest

和

requests_mock

编写了一些测试：

import pytest
import requests
import requests_mock

from main import get_filename

TEST_URL = 'https://pwrk.us/report.pdf'


@pytest.mark.parametrize(
    'headers,expected_filename',
    [
        (
                {'Content-Disposition': 'attachment; filename="filename.pdf"'},
                "filename.pdf"
        ),
        (
                # The string following filename should always be put into quotes;
                # but, for compatibility reasons, many browsers try to parse unquoted names that contain spaces.
                # https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition#directives
                {'Content-Disposition': 'attachment; filename=filename with spaces.pdf'},
                "filename with spaces.pdf"
        ),
        (
                {'Content-Disposition': 'attachment;'},
                "report.pdf"
        ),
        (
                {'Content-Disposition': 'inline;'},
                "report.pdf"
        ),
        (
                {},
                "report.pdf"
        )
    ]
)
def test_get_filename(headers, expected_filename):
    with requests_mock.Mocker() as m:
        m.get(TEST_URL, text='resp', headers=headers)
        assert get_filename(TEST_URL) == expected_filename


def test_get_filename_exception():
    with requests_mock.Mocker() as m:
        m.get(TEST_URL, exc=requests.exceptions.RequestException)
        with pytest.raises(requests.exceptions.RequestException):
            get_filename(TEST_URL)

Answer 7

使用

urllib.request

而不是

requests

，因为这样你就可以做

urllib.request.urlopen(

...

).

headers

get_filename()

，这比其他一些答案更安全原因如下：

如果 [Content-Disposition] 标头没有
filename
参数，则此方法会回退到在
Content-Type
标头中查找 name 参数。

之后，更安全的做法是另外回退到 URL 中的文件名，就像另一个答案一样。

如何使用Python请求获取pdf文件名？

问题描述投票：0回答：7

7个回答

最新问题

如何使用Python请求获取pdf文件名？

问题描述 投票：0回答：7

7个回答

最新问题

问题描述投票：0回答：7