python/beautifulsoup 查找所有具有特定锚文本的 <a href>

Question

Answer 1

这样的东西有用吗？

In [39]: from bs4 import BeautifulSoup

In [40]: s = """\
   ....: <a href="http://example.com">TEXT</a>
   ....: <a href="http://example.com/link">TEXT</a>
   ....: <a href="http://example.com/page">TEXT</a>
   ....: <a href="http://dontmatchme.com/page">WRONGTEXT</a>"""

In [41]: soup = BeautifulSoup(s)

In [42]: for link in soup.findAll('a', href=True, text='TEXT'):
   ....:     print link['href']
   ....:
   ....:
http://example.com
http://example.com/link
http://example.com/page

Answer 2

自 BeautifulSoup 4.4.0 起，

text=

参数已被弃用，取而代之的是

string=

。因此，要查找具有特定文本的所有锚标记，您可以使用以下命令：

[elm['href'] for elm in soup.find_all("a", string='TEXT')]

上述检查过滤字符串完全匹配的标签。如果您有其他条件，例如锚文本必须以特定字符串开头，您还可以传递正则表达式或过滤该字符串的函数：

# filter anchor tags whose text starts with `TEXT`
import re
[elm['href'] for elm in soup.find_all("a", string=re.compile("^TEXT"))]

# or a plain string check
[elm['href'] for elm in soup.find_all("a", string=lambda x: x.startswith('TEXT'))]

最后，由于

.find_all

或

.select

返回一个 ResultSet 对象，该对象本质上是一个 Python 列表，因此您可以使用 if 语句过滤其结果：

[elm['href'] for elm in soup.find_all("a") if elm.string == 'TEXT']

python/beautifulsoup 查找所有具有特定锚文本的 <a href>

问题描述投票：0回答：2

2个回答

最新问题

python/beautifulsoup 查找所有具有特定锚文本的 <a href>

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2