如何安全地从 URL 获取文件扩展名？

Question

考虑以下 URL

http://m3u.com/tunein.m3u
http://asxsomeurl.com/listen.asx:8024
http://www.plssomeotherurl.com/station.pls?id=111
http://22.198.133.16:8024

确定文件扩展名（.m3u/.asx/.pls）的正确方法是什么？显然最后一个没有文件扩展名。

编辑：我忘了提及 m3u/asx/pls 是音频流的播放列表（文本文件），必须以不同的方式进行解析。目标确定扩展名，然后将 url 发送到正确的解析函数。例如。


url = argv[1]
ext = GetExtension(url)
if ext == "pls":
  realurl = ParsePLS(url)
elif ext == "asx":
  realurl = ParseASX(url)
(etc.)
else:
  realurl = url
Play(realurl)

GetExtension() 应返回文件扩展名（如果有），最好不要连接到 URL。

Answer 1

使用

urlparse

解析 URL 中的路径，然后使用

os.path.splitext

获取扩展名。

import os
try:
    import urlparse
except ImportError:
    from urllib.parse import urlparse

url = 'http://www.plssomeotherurl.com/station.pls?id=111'
path = urlparse(url).path
ext = os.path.splitext(path)[1]

请注意，扩展名可能无法可靠地指示文件类型。 HTTP

Content-Type

标头可能会更好。

Answer 2

使用

requests

和

mimetypes

最简单：

import requests
import mimetypes

response = requests.get(url)
content_type = response.headers['content-type']
extension = mimetypes.guess_extension(content_type)

扩展名包含一个点前缀。例如，对于内容类型

extension

，

'.png'

是

'image/png'

。

Answer 3

真正的正确方法是根本不使用文件扩展名。对相关 URL 执行 GET（或 HEAD）请求，并使用返回的“Content-type”HTTP 标头来获取内容类型。文件扩展名不可靠。请参阅

MIME 类型（IANA 媒体类型）

，了解更多信息和有用 MIME 类型的列表。

Answer 4

http://code.google.com/p/unladen-swallow/source/browse/branches/release-2009Q1-maint/Lib/psyco/support.py?r=292

尽管页面是 HTML 而不是 Python，但仍希望扩展名为“.py”？使用 Content-Type 标头来确定 URL 的“类型”。

Answer 5

import urllib2 def getContentType(pageUrl): page = urllib2.urlopen(pageUrl) pageHeaders = page.headers contentType = pageHeaders.getheader('content-type') return contentType

Answer 6

http://docs.python.org/library/urlparse.html

然后将“路径”分开。您也许可以使用 os.path.split 拆分路径，但末尾带有 :8024 的示例 2 需要手动处理。您的文件扩展名总是三个字母吗？还是总是字母和数字？使用正则表达式。

Answer 7

def fileExt( url ): # compile regular expressions reQuery = re.compile( r'\?.*$', re.IGNORECASE ) rePort = re.compile( r':[0-9]+', re.IGNORECASE ) reExt = re.compile( r'(\.[A-Za-z0-9]+$)', re.IGNORECASE ) # remove query string url = reQuery.sub( "", url ) # remove port url = rePort.sub( "", url ) # extract extension matches = reExt.search( url ) if None != matches: return matches.group( 1 ) return None

编辑：添加了对来自 :1234

的显式端口的处理

Answer 8

rfc6266

模块，例如： import requests import rfc6266 req = requests.head(downloadLink) headersContent = req.headers['Content-Disposition'] rfcFilename = rfc6266.parse_headers(headersContent, relaxed=True).filename_unsafe filename = requests.utils.unquote(rfcFilename)

Answer 9

file_ext = "."+ url.split("/")[-1:][0].split(".")[-1:][0]

假设有文件扩展名。

如何安全地从 URL 获取文件扩展名？

问题描述投票：0回答：10

10个回答

最新问题

如何安全地从 URL 获取文件扩展名？

问题描述 投票：0回答：10

10个回答

最新问题

问题描述投票：0回答：10