如何用python从html页面中提取html链接?

问题描述 投票:0回答:2

从这个Python代码,

...
resp = logout_session.get(logout_url, headers=headers, verify=False, allow_redirects=False)
soup = BeautifulSoup(resp.content, "html.parser")
print(soup.prettify())

我能够进行API调用,响应内容如下:

<!DOCTYPE html>
<html>
 <head>...</head>
 <body>
  <div class="container">
   <div class="title logo" id="header">
    <img alt="" id="business-logo-login" src="/customviews/image/business_logo:f0a067275aba3c71c62cffa2f50ac69c/"/>
   </div>
   <div class="input-group alert alert-success text-center" id="title" role="alert">
    Successfully signed out
   </div>
   <div class="input-group alert text-center">
    <a href="/saml-idp/portal/">
     Login again
    </a>
   </div>
   <div>
    <p>
     You will be redirected to https://idpftc.business.com/saml/Gy736KPK3v1aWDPECRZKAn/proxy_logout/ after 5 seconds ...
    </p>
    <script language="javascript" nonce="">
     window.onload = window.setTimeout(function() {
    window.location.replace("https://idpftc.business.com/saml/Gy736KPK3v1aWDPECRZKAn/proxy_logout/?SAMLResponse=3VjJkuNIjv2VtKijLJObJIphlWnGfd93Xtoo7vsukvr6ZkRU");}, 5000);
    </script>
   </div>
  </div>
 </body>
</html>

现在我想提取html链接:

https://idpftc.business.com/saml/Gy736KPK3v1aWDPECRZKAn/proxy_logout/?SAMLResponse=3VjJkuNIjv2VtKijLJObJIphlWnGfd93Xtoo7vsukvr6ZkRU 

从这个内容来看,有谁知道如何用 python 做到这一点吗?

python beautifulsoup urlparse
2个回答
0
投票

尝试这样的事情:

from bs4 import bs
api ="""your response above"""
soup = bs(api,"html.parser")
scr = soup.select_one('script').string
scr.split('"')[1]

输出应该是url。


0
投票

尝试:

import re

# resp = requests.get(...)

url = re.search(r'window\.location\.replace\("([^"]+)', resp.text).group(1)
print(url)

打印:

https://idpftc.business.com/saml/Gy736KPK3v1aWDPECRZKAn/proxy_logout/?SAMLResponse=3VjJkuNIjv2VtKijLJObJIphlWnGfd93Xtoo7vsukvr6ZkRU
© www.soinside.com 2019 - 2024. All rights reserved.