如何在 Python 中跟踪元刷新

问题描述 投票:0回答:6

Python的urllib2遵循3xx重定向来获取最终内容。有没有办法让 urllib2 (或其他一些库,例如httplib2)也遵循元刷新?或者我是否需要手动解析 HTML 以获取刷新元标记?

python redirect refresh urllib2 httplib2
6个回答
12
投票

这是使用 BeautifulSoup 和 httplib2 (以及基于证书的身份验证)的解决方案:

import BeautifulSoup
import httplib2

def meta_redirect(content):
    soup  = BeautifulSoup.BeautifulSoup(content)

    result=soup.find("meta",attrs={"http-equiv":"Refresh"})
    if result:
        wait,text=result["content"].split(";")
        if text.strip().lower().startswith("url="):
            url=text.strip()[4:]
            return url
    return None

def get_content(url, key, cert):
    
    h=httplib2.Http(".cache")
    h.add_certificate(key,cert,"")
    
    resp, content = h.request(url,"GET")
    
    # follow the chain of redirects
    while meta_redirect(content):
        resp, content = h.request(meta_redirect(content),"GET") 
            
    return content  

5
投票

使用 requests 和 lxml 库的类似解决方案。还做了一个简单的检查,确保被测试的东西实际上是 HTML(我的实现中的一个要求)。还能够通过使用请求库的会话来捕获和使用 cookie(如果重定向 + cookie 被用作反抓取机制,有时是必要的)。

import magic
import mimetypes
import requests
from lxml import html 
from urlparse import urljoin

def test_for_meta_redirections(r):
    mime = magic.from_buffer(r.content, mime=True)
    extension = mimetypes.guess_extension(mime)
    if extension == '.html':
        html_tree = html.fromstring(r.text)
        attr = html_tree.xpath("//meta[translate(@http-equiv, 'REFSH', 'refsh') = 'refresh']/@content")[0]
        wait, text = attr.split(";")
        if text.lower().startswith("url="):
            url = text[4:]
            if not url.startswith('http'):
                # Relative URL, adapt
                url = urljoin(r.url, url)
            return True, url
    return False, None


def follow_redirections(r, s):
    """
    Recursive function that follows meta refresh redirections if they exist.
    """
    redirected, url = test_for_meta_redirections(r)
    if redirected:
        r = follow_redirections(s.get(url), s)
    return r

用途:

s = requests.session()
r = s.get(url)
# test for and follow meta redirects
r = follow_redirections(r, s)

1
投票

好吧,似乎没有库支持它,所以我一直在使用这段代码:

import urllib2
import urlparse
import re

def get_hops(url):
    redirect_re = re.compile('<meta[^>]*?url=(.*?)["\']', re.IGNORECASE)
    hops = []
    while url:
        if url in hops:
            url = None
        else:
            hops.insert(0, url)
            response = urllib2.urlopen(url)
            if response.geturl() != url:
                hops.insert(0, response.geturl())
            # check for redirect meta tag
            match = redirect_re.search(response.read())
            if match:
                url = urlparse.urljoin(url, match.groups()[0].strip())
            else:
                url = None
    return hops

1
投票

如果你不想使用bs4,你可以像这样使用lxml:

from lxml.html import soupparser

def meta_redirect(content):
    root = soupparser.fromstring(content)
    result_url = root.xpath('//meta[@http-equiv="refresh"]/@content')
    if result_url:
        result_url = str(result_url[0])
        urls = result_url.split('URL=') if len(result_url.split('url=')) < 2    else result_url.split('url=')
        url = urls[1] if len(urls) >= 2 else None
    else:
        return None
    return url

0
投票

我想提供此处编写的代码的更新版本。考虑到 urlparse 不适用于 Python 3 并且已被 urllib.parse 取代,我已经针对 Python 3.10+ 进行了自己的调整,但也删除了 httplib2/urllib2 并转而支持 requests。请理解,我合并了此处已有帖子中的代码并进行了修改以创建更新的答案,这只是一个协作更新,而不是我自己的工作。

import requests as re
from bs4 import BeautifulSoup
from urllib.parse import urljoin

def meta_redirect(content):
    soup  = BeautifulSoup(content.text.lower(), features='html.parser')

    result=soup.find("meta", attrs={"http-equiv":"refresh"})
    
    if result:
        wait, text=result["content"].split(";")
        
        if text.strip().lower().startswith("url="):            
            
            url=text.strip()[4:].replace("'","")
    
            if not url.startswith('http'):
                url = urljoin(content.url, url)
            
            return url
    return None

def get_content(url):
    
    content = re.get(url, verify=False)
    
    # follow the chain of redirects
    while meta_redirect(content):
        content = re.get(meta_redirect(content), verify=False) 
            
    return content

def main():
    url = 'put your url here that has meta redirects'

    source = get_content(url)

    # This will print the source of the website
    print(source.text)

-1
投票

使用BeautifulSoup或lxml来解析HTML。

© www.soinside.com 2019 - 2024. All rights reserved.