BeautifulSoup：只要进入标签内部，无论有多少个封闭标签

Question

Answer 1

简短回答：

soup.findAll(text=True)

这个问题已经在 StackOverflow 和 BeautifulSoup 文档中得到了解答。

更新：

澄清一下，一段工作代码：

>>> txt = """\
... <p>Red</p>
... <p><i>Blue</i></p>
... <p>Yellow</p>
... <p>Light <b>green</b></p>
... """
>>> import BeautifulSoup
>>> BeautifulSoup.__version__
'3.0.7a'
>>> soup = BeautifulSoup.BeautifulSoup(txt)
>>> for node in soup.findAll('p'):
...     print ''.join(node.findAll(text=True))

Red
Blue
Yellow
Light green

Answer 2

接受的答案很棒，但现在已经有 6 年历史了，所以这是该答案当前的 Beautiful Soup 4 版本：

>>> txt = """\
<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>
"""
>>> from bs4 import BeautifulSoup, __version__
>>> __version__
'4.5.1'
>>> soup = BeautifulSoup(txt, "html.parser")
>>> print("".join(soup.strings))

Red
Blue
Yellow
Light green

Answer 3

我偶然发现了这个同样的问题，并想分享这个解决方案的 2019 版本。也许它可以帮助某人。

# importing the modules
from bs4 import BeautifulSoup
from urllib.request import urlopen

# setting up your BeautifulSoup Object
webpage = urlopen("https://insertyourwebpage.com")
soup = BeautifulSoup( webpage.read(), features="lxml")
p_tags = soup.find_all('p')


for each in p_tags: 
    print (str(each.get_text()))

请注意，我们首先逐一打印数组内容，然后调用 get_text() 方法从文本中剥离标签，这样我们就只打印出文本。

还有：

在 bs4 中使用更新的 'find_all()' 比旧的 findAll() 更好
urllib2 被 urllib.request 和 urllib.error 取代，参见这里

现在你的输出应该是：

红色
蓝色
黄色
光

希望这可以帮助寻找更新解决方案的人。

Answer 4

通常从网站上抓取的数据都会包含标签。要避免标签并仅显示文本内容，可以使用文本属性。

例如，

    from BeautifulSoup import BeautifulSoup

    import urllib2 
    url = urllib2.urlopen("https://www.python.org")

    content = url.read()

    soup = BeautifulSoup(content)

    title = soup.findAll("title")

    paragraphs = soup.findAll("p")

    print paragraphs[1] //Second paragraph with tags

    print paragraphs[1].text //Second paragraph without tags

在此示例中，我从 python 站点收集所有段落并显示带标签和不带标签的内容。

Answer 5

首先，使用

str

将 html 转换为字符串。然后，在您的程序中使用以下代码：

import re
x = str(soup.find_all('p'))
content = str(re.sub("<.*?>", "", x))

这称为

regex

。这将删除两个 html 标签之间的任何内容（包括标签）。

Answer 6

我认为有一种更简单的方法来获取所有内部文本

请参阅此处的文档。

from bs4 import BeautifulSoup

soup = BeautifulSoup("""
<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>
""", "html.parser")

print(list(map(lambda x: x.get_text(), soup.find_all("p"))))

BeautifulSoup：只要进入标签内部，无论有多少个封闭标签

问题描述投票：0回答：6

6个回答

最新问题

BeautifulSoup：只要进入标签内部，无论有多少个封闭标签

问题描述 投票：0回答：6

6个回答

最新问题

问题描述投票：0回答：6