我正在尝试“defrontpagify”MS FrontPage 生成的网站的 html,并且我正在编写一个 BeautifulSoup 脚本来执行此操作。
但是,我陷入了尝试从包含特定属性(或列表属性)的文档中的每个标签中删除特定属性(或列表属性)的部分。代码片段:
REMOVE_ATTRIBUTES = ['lang','language','onmouseover','onmouseout','script','style','font',
'dir','face','size','color','style','class','width','height','hspace',
'border','valign','align','background','bgcolor','text','link','vlink',
'alink','cellpadding','cellspacing']
# remove all attributes in REMOVE_ATTRIBUTES from all tags,
# but preserve the tag and its content.
for attribute in REMOVE_ATTRIBUTES:
for tag in soup.findAll(attribute=True):
del(tag[attribute])
它运行时没有错误,但实际上并没有删除任何属性。当我在没有外循环的情况下运行它时,只需对单个属性进行硬编码(soup.findAll('style'=True)),它就可以工作。
有谁知道这里的问题吗?
PS - 我也不太喜欢嵌套循环。如果有人知道更实用的地图/过滤器风格,我很乐意看到它。
线路
for tag in soup.findAll(attribute=True):
未找到任何
tag
。可能有一种方法可以使用findAll
,我不确定。
但是,这是有效的(从 beautifulsoup 4.8.1 开始):
import bs4
REMOVE_ATTRIBUTES = [
'lang','language','onmouseover','onmouseout','script','style','font',
'dir','face','size','color','style','class','width','height','hspace',
'border','valign','align','background','bgcolor','text','link','vlink',
'alink','cellpadding','cellspacing']
doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = bs4.BeautifulSoup.BeautifulSoup(doc)
for tag in soup.descendants:
if isinstance(tag, bs4.element.Tag):
tag.attrs = {key: value for key, value in tag.attrs
if key not in REMOVE_ATTRIBUTES}
print(soup.prettify())
这是以前的代码,可能适用于旧版本的 beautifulsoup:
import BeautifulSoup
REMOVE_ATTRIBUTES = [
'lang','language','onmouseover','onmouseout','script','style','font',
'dir','face','size','color','style','class','width','height','hspace',
'border','valign','align','background','bgcolor','text','link','vlink',
'alink','cellpadding','cellspacing']
doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)
for tag in soup.recursiveChildGenerator():
try:
tag.attrs = [(key,value) for key,value in tag.attrs
if key not in REMOVE_ATTRIBUTES]
except AttributeError:
# 'NavigableString' object has no attribute 'attrs'
pass
print(soup.prettify())
请注意,此代码只能在 Python 3 中运行。如果您需要它在 Python 2 中运行,请参阅下面 Nóra 的答案。
这是 unutbu 答案的 Python 2 版本:
REMOVE_ATTRIBUTES = ['lang','language','onmouseover']
doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)
for tag in soup.recursiveChildGenerator():
if hasattr(tag, 'attrs'):
tag.attrs = {key:value for key,value in tag.attrs.iteritems()
if key not in REMOVE_ATTRIBUTES}
Just ftr:这里的问题是,如果您将 HTML 属性作为关键字参数传递,则关键字是属性的 name 。因此,您的代码正在搜索具有名称属性
attribute
的标签,因为变量不会扩展。
这就是原因
要解决此问题,请将您要查找的属性作为
dict
: 传递
for attribute in REMOVE_ATTRIBUTES:
for tag in soup.find_all(attrs={attribute: True}):
del tag[attribute]
未来的某个人, dtk
[0]:虽然在您的示例中需要是
find_all(style=True)
,但不带引号,因为 SyntaxError: keyword can't be an expression
我用的是这个:
if "align" in div.attrs:
del div.attrs["align"]
或
if "align" in div.attrs:
div.attrs.pop("align")
我使用这个方法来删除属性列表,非常紧凑:
attributes_to_del = ["style", "border", "rowspan", "colspan", "width", "height",
"align", "valign", "color", "bgcolor", "cellspacing",
"cellpadding", "onclick", "alt", "title"]
for attr_del in attributes_to_del:
[s.attrs.pop(attr_del) for s in soup.find_all() if attr_del in s.attrs]