如何用python从html页面中提取html链接？

Question

从这个Python代码，

...
resp = logout_session.get(logout_url, headers=headers, verify=False, allow_redirects=False)
soup = BeautifulSoup(resp.content, "html.parser")
print(soup.prettify())

我能够进行API调用，响应内容如下：

<!DOCTYPE html>
<html>
 <head>...</head>
 <body>
  <div class="container">
   <div class="title logo" id="header">
    <img alt="" id="business-logo-login" src="/customviews/image/business_logo:f0a067275aba3c71c62cffa2f50ac69c/"/>
   </div>
   <div class="input-group alert alert-success text-center" id="title" role="alert">
    Successfully signed out
   </div>
   <div class="input-group alert text-center">
    <a href="/saml-idp/portal/">
     Login again
    </a>
   </div>
   <div>
    <p>
     You will be redirected to https://idpftc.business.com/saml/Gy736KPK3v1aWDPECRZKAn/proxy_logout/ after 5 seconds ...
    </p>
    <script language="javascript" nonce="">
     window.onload = window.setTimeout(function() {
    window.location.replace("https://idpftc.business.com/saml/Gy736KPK3v1aWDPECRZKAn/proxy_logout/?SAMLResponse=3VjJkuNIjv2VtKijLJObJIphlWnGfd93Xtoo7vsukvr6ZkRU");}, 5000);
    </script>
   </div>
  </div>
 </body>
</html>

现在我想提取html链接：

https://idpftc.business.com/saml/Gy736KPK3v1aWDPECRZKAn/proxy_logout/?SAMLResponse=3VjJkuNIjv2VtKijLJObJIphlWnGfd93Xtoo7vsukvr6ZkRU

从这个内容来看，有谁知道如何用 python 做到这一点吗？

Answer 1

尝试这样的事情：

from bs4 import bs
api ="""your response above"""
soup = bs(api,"html.parser")
scr = soup.select_one('script').string
scr.split('"')[1]

输出应该是url。

Answer 2

尝试：

import re

# resp = requests.get(...)

url = re.search(r'window\.location\.replace\("([^"]+)', resp.text).group(1)
print(url)

打印：

https://idpftc.business.com/saml/Gy736KPK3v1aWDPECRZKAn/proxy_logout/?SAMLResponse=3VjJkuNIjv2VtKijLJObJIphlWnGfd93Xtoo7vsukvr6ZkRU

如何用python从html页面中提取html链接？

问题描述投票：0回答：2

2个回答

最新问题

如何用python从html页面中提取html链接？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2