获取html中的所有链接，包括条件注释中的链接。

Question

假设我有这个简单的html。

<html>
  <body>

    <!--[if !mso]><!-->
    <a href="http://link1.com">Link 1</a>
    <!--<![endif]-->

    <!--[if mso]>
      <a href="http://link2.com">Link 2</a>
    <![endif]-->

  </body>
</html>

有没有办法用 lxml.html 或 BeautifulSoup 获得两个链接？目前我只得到一个链接。换句话说，我希望解析器也能查看html条件注释（不知道是什么技术术语）。

lxml.html

>>> from lxml import html
>>> doc = html.fromstring(s)
>>> list(doc.iterlinks())

<<< [(<Element a at 0x10f7f7bf0>, 'href', 'http://link1.com', 0)]

美丽汤

>>> from BeautifulSoup import BeautifulSoup
>>> b = BeautifulSoup(s)
>>> b.findAll('a')

<<< [<a href="http://link1.com">Link 1</a>]

Answer 1

需要提取注释，然后解析这些注释。

html = '''<html>
  <body>

    <!--[if !mso]><!-->
    <a href="http://link1.com">Link 1</a>
    <!--<![endif]-->

    <!--[if mso]>
      <a href="http://link2.com">Link 2</a>
    <![endif]-->

  </body>
</html>'''



from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(html, 'html.parser')

links = soup.find_all('a', href=True)

comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for comment in comments:
    if BeautifulSoup(comment).find_all('a', href=True):
        links += BeautifulSoup(comment).find_all('a', href=True)

print (links)

输出。

[<a href="http://link1.com">Link 1</a>, <a href="http://link2.com">Link 2</a>]

获取html中的所有链接，包括条件注释中的链接。

问题描述投票：1回答：1

1个回答

最新问题

获取html中的所有链接，包括条件注释中的链接。

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1