如何从 HTML 字符串中获取 beautiful soup 中的开始和结束标签？

Question

我正在使用 beautiful soup 编写一个 python 脚本，其中我必须从包含一些 HTML 代码的字符串中获取开始标记。

这是我的字符串：

string = <p>...</p>

我想在名为

<p>

的变量中获取

opening_tag

，并在名为

</p>

的变量中获取

closing_tag

。我搜索了文档但似乎没有找到解决方案。有人可以给我建议吗？

Answer 1

没有直接的方法可以获取

BeautifulSoup

中标签的打开和关闭部分，但是，至少，您可以获取它的 name：

>>> from bs4 import BeautifulSoup
>>> 
>>> html_content = """
... <body>
...     <p>test</p>
... </body>
...  """
>>> soup = BeautifulSoup(html_content, "lxml")
>>> p = soup.p
>>> print(p.name)
p

使用

html.parser

，您可以收听“开始”和“结束”标签“事件”。

Answer 2

有一种方法可以使用 BeautifulSoup 和一个简单的正则表达式来做到这一点：

将段落放入 BeautifulSoup 对象中，例如 soupParagraph。
对于开始标签 (
```
<p>
```
) 和结束标签 (
```
</p>
```
) 之间的内容，将内容移动到另一个 BeautifulSoup 对象，例如 soupInnerParagraph。（通过移动内容，它们不会被删除）。
然后， soupParagraph 将只有开始和结束标签。
将 soupParagraph 转换为 HTML 文本格式并将其存储在字符串变量中
要获取开始标记，请使用正则表达式从字符串变量中删除结束标记。

一般来说，使用正则表达式解析 HTML 是有问题的，通常最好避免。不过，这里可能是合理的。

结束标签很简单。它没有为其定义属性，并且内部不允许有注释。

我可以在结束标签上添加属性吗？

元素开始标签内的 HTML 注释

此代码从

<body...>

...

</body>

部分获取开始标记。代码已经过测试。

# The variable "body" is a BeautifulSoup object that contains a <body> section.
bodyInnerHtml = BeautifulSoup("", 'html.parser')
bodyContentsList = body.contents
for i in range(0, len(bodyContentsList)):
    # .append moves the HTML element from body to bodyInnerHtml
    bodyInnerHtml.append(bodyContentsList[0])

# Convert the <body> opening and closing tags to HTML text format
bodyTags = body.decode(formatter='html')
# Extract the opening tag, by removing the closing tag
regex = r"(\s*<\/body\s*>\s*$)\Z"
substitution = ""
bodyOpeningTag, substitutionCount = re.subn(regex, substitution, bodyTags, 0, re.M)
if (substitutionCount != 1):
    print("")
    print("ERROR.  The expected HTML </body> tag was not found.")

Answer 3

据我所知，

BeautifulSoup

API 中没有内置方法可以按原样返回开始标签，但我们可以为此创建一个小函数。

from bs4 import BeautifulSoup
from bs4.element import Tag


# here's your function
def get_opening_tag(element: Tag) -> str:
    """returns the opening tag of the given element"""
    raw_attrs = {k: v if not isinstance(v, list) else ' '.join(v) for k, v in element.attrs.items()}
    attrs = ' '.join((f"{k}=\"{v}\"" for k, v in raw_attrs.items()))
    return f"<{element.name} {attrs}>"


def test():

    markup = """
    <html>
        <body>
            <div id="root" class="class--name">
                ...
            </div>
        </body>
    </html>
    """

    # if you're interested in the div tag
    element = BeautifulSoup(markup, 'lxml').select_one("#root")

    print(get_opening_tag(element))


if __name__ == '__main__':
    test()

Answer 4

使用BeautifulSoup：

from bs4 import BeautifulSoup, Tag

def get_tags(bs4_element: Tag):
    try:
        opening_tag, closing_tag = str(bs4_element).split(
            ''.join(str(child) for child in bs4_element.children)
        )
        return opening_tag, closing_tag
    except ValueError:
        print('Cannot parse children correctly')
        return None

该功能可用于例如：

soup = BeautifulSoup(text)

for element in soup.find_all():
    print(get_tags(element))

旧答案：

一种仅适用于无子元素的简单方法：

opening_tag, closing_tag = str(element).split(element.text)

如何从 HTML 字符串中获取 beautiful soup 中的开始和结束标签？

问题描述投票：0回答：4

4个回答

最新问题

如何从 HTML 字符串中获取 beautiful soup 中的开始和结束标签？

问题描述 投票：0回答：4

4个回答

最新问题

问题描述投票：0回答：4