BeautifulSoup：剥离指定属性，但保留标签及其内容

Question

我正在尝试“defrontpagify”MS FrontPage 生成的网站的 html，并且我正在编写一个 BeautifulSoup 脚本来执行此操作。

但是，我陷入了尝试从包含特定属性（或列表属性）的文档中的每个标签中删除特定属性（或列表属性）的部分。代码片段：

REMOVE_ATTRIBUTES = ['lang','language','onmouseover','onmouseout','script','style','font',
                        'dir','face','size','color','style','class','width','height','hspace',
                        'border','valign','align','background','bgcolor','text','link','vlink',
                        'alink','cellpadding','cellspacing']

# remove all attributes in REMOVE_ATTRIBUTES from all tags, 
# but preserve the tag and its content. 
for attribute in REMOVE_ATTRIBUTES:
    for tag in soup.findAll(attribute=True):
        del(tag[attribute])

它运行时没有错误，但实际上并没有删除任何属性。当我在没有外循环的情况下运行它时，只需对单个属性进行硬编码（soup.findAll（'style'=True）），它就可以工作。

有谁知道这里的问题吗？

PS - 我也不太喜欢嵌套循环。如果有人知道更实用的地图/过滤器风格，我很乐意看到它。

Answer 1

线路

for tag in soup.findAll(attribute=True):

未找到任何

tag

。可能有一种方法可以使用

findAll

，我不确定。

但是，这是有效的（从 beautifulsoup 4.8.1 开始）：

import bs4
REMOVE_ATTRIBUTES = [
    'lang','language','onmouseover','onmouseout','script','style','font',
    'dir','face','size','color','style','class','width','height','hspace',
    'border','valign','align','background','bgcolor','text','link','vlink',
    'alink','cellpadding','cellspacing']

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = bs4.BeautifulSoup.BeautifulSoup(doc)
for tag in soup.descendants:
    if isinstance(tag, bs4.element.Tag):
        tag.attrs = {key: value for key, value in tag.attrs
                     if key not in REMOVE_ATTRIBUTES}
print(soup.prettify())

这是以前的代码，可能适用于旧版本的 beautifulsoup：

import BeautifulSoup
REMOVE_ATTRIBUTES = [
    'lang','language','onmouseover','onmouseout','script','style','font',
    'dir','face','size','color','style','class','width','height','hspace',
    'border','valign','align','background','bgcolor','text','link','vlink',
    'alink','cellpadding','cellspacing']

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)
for tag in soup.recursiveChildGenerator():
    try:
        tag.attrs = [(key,value) for key,value in tag.attrs
                     if key not in REMOVE_ATTRIBUTES]
    except AttributeError: 
        # 'NavigableString' object has no attribute 'attrs'
        pass
print(soup.prettify())

请注意，此代码只能在 Python 3 中运行。如果您需要它在 Python 2 中运行，请参阅下面 Nóra 的答案。

Answer 2

这是 unutbu 答案的 Python 2 版本：

REMOVE_ATTRIBUTES = ['lang','language','onmouseover']

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''

soup = BeautifulSoup.BeautifulSoup(doc)

for tag in soup.recursiveChildGenerator():
    if hasattr(tag, 'attrs'):
        tag.attrs = {key:value for key,value in tag.attrs.iteritems()
                    if key not in REMOVE_ATTRIBUTES}

Answer 3

Just ftr：这里的问题是，如果您将 HTML 属性作为关键字参数传递，则关键字是属性的 name 。因此，您的代码正在搜索具有名称属性

attribute

的标签，因为变量不会扩展。

这就是原因

对属性名称进行硬编码有效[0]
代码不会失败。搜索结果与任何标签都不匹配

要解决此问题，请将您要查找的属性作为

dict

:

传递

for attribute in REMOVE_ATTRIBUTES:
    for tag in soup.find_all(attrs={attribute: True}):
        del tag[attribute]

未来的某个人， dtk

[0]：虽然在您的示例中需要是

find_all(style=True)

，但不带引号，因为

SyntaxError: keyword can't be an expression

Answer 4

我用的是这个：

if "align" in div.attrs:
    del div.attrs["align"]

或

if "align" in div.attrs:
    div.attrs.pop("align")

感谢https://stackoverflow.com/a/22497855/1907997

Answer 5

我使用这个方法来删除属性列表，非常紧凑：

attributes_to_del = ["style", "border", "rowspan", "colspan", "width", "height", 
                     "align", "valign", "color", "bgcolor", "cellspacing", 
                     "cellpadding", "onclick", "alt", "title"]
for attr_del in attributes_to_del: 
    [s.attrs.pop(attr_del) for s in soup.find_all() if attr_del in s.attrs]

BeautifulSoup：剥离指定属性，但保留标签及其内容

问题描述投票：0回答：5

5个回答

最新问题

BeautifulSoup：剥离指定属性，但保留标签及其内容

问题描述 投票：0回答：5

5个回答

最新问题

问题描述投票：0回答：5