我有一个包含文本和HTML的字符串。我想删除或以其他方式禁用某些HTML标记,例如<script>
,同时允许其他标记,以便我可以在网页上安全地呈现它。我有一个允许标签的列表,如何处理该字符串以删除任何其他标签?
这是使用BeautifulSoup的简单解决方案:
from bs4 import BeautifulSoup
VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br']
def sanitize_html(value):
soup = BeautifulSoup(value)
for tag in soup.findAll(True):
if tag.name not in VALID_TAGS:
tag.hidden = True
return soup.renderContents()
如果还要删除无效标签的内容,请用tag.extract()
代替tag.hidden
。
使用lxml.html.clean
!非常简单!
lxml.html.clean
假设以下html:
from lxml.html.clean import clean_html
print clean_html(html)
结果...
html = '''\
<html>
<head>
<script type="text/javascript" src="evil-site"></script>
<link rel="alternate" type="text/rss" src="evil-rss">
<style>
body {background-image: url(javascript:do_evil)};
div {color: expression(evil)};
</style>
</head>
<body onload="evil_function()">
<!-- I am interpreted for EVIL! -->
<a href="javascript:evil_function()">a link</a>
<a href="#" onclick="evil_function()">another link</a>
<p onclick="evil_function()">a paragraph</p>
<div style="display: none">secret EVIL!</div>
<object> of EVIL! </object>
<iframe src="evil-site"></iframe>
<form action="evil-site">
Password: <input type="password" name="password">
</form>
<blink>annoying EVIL!</blink>
<a href="evil-site">spam spam SPAM!</a>
<image src="evil!">
</body>
</html>'''
您可以自定义要清除的元素,什么都可以。
以上通过Beautiful Soup提供的解决方案将不起作用。您可能可以使用“美丽的汤”来破解某些东西,因为“美丽的汤”提供了对解析树的访问。一段时间后,我认为我会尽力解决问题,但这是一个为期一周的项目,而且我很快就没有空闲的一周。
只是具体而言,Beautiful Soup不仅会因上述代码无法捕获的某些分析错误而引发异常;而且,还有很多尚未发现的非常真实的XSS漏洞,例如:
<html>
<body>
<div>
<style>/* deleted */</style>
<a href="">a link</a>
<a href="#">another link</a>
<p>a paragraph</p>
<div>secret EVIL!</div>
of EVIL!
Password:
annoying EVIL!
<a href="evil-site">spam spam SPAM!</a>
<img src="evil!">
</div>
</body>
</html>
[也许最好的办法是改为将<<script>script> alert("Haha, I hacked your page."); </</script>script>
元素剥离为<
,禁止all HTML,然后使用Markdown之类的受限子集来正确呈现格式。特别是,您还可以返回并使用正则表达式重新引入HTML的通用位。大致如下所示:
<
我尚未测试该代码,因此可能存在错误。但是您会看到一个大致的想法:将所有HTML列入白名单之前,必须将所有HTML列入黑名单。
这是我在自己的项目中使用的。 accept_elements / attributes来自_lt_ = re.compile('<')
_tc_ = '~(lt)~' # or whatever, so long as markdown doesn't mangle it.
_ok_ = re.compile(_tc_ + '(/?(?:u|b|i|em|strong|sup|sub|p|br|q|blockquote|code))>', re.I)
_sqrt_ = re.compile(_tc_ + 'sqrt>', re.I) #just to give an example of extending
_endsqrt_ = re.compile(_tc_ + '/sqrt>', re.I) #html syntax with your own elements.
_tcre_ = re.compile(_tc_)
def sanitize(text):
text = _lt_.sub(_tc_, text)
text = markdown(text)
text = _ok_.sub(r'<\1>', text)
text = _sqrt_.sub(r'√<span style="text-decoration:overline;">', text)
text = _endsqrt_.sub(r'</span>', text)
return _tcre_.sub('<', text)
,BeautifulSoup完成了工作。
feedparser
一些小测试以确保其行为正确:
from BeautifulSoup import BeautifulSoup
acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area', 'b', 'big',
'blockquote', 'br', 'button', 'caption', 'center', 'cite', 'code', 'col',
'colgroup', 'dd', 'del', 'dfn', 'dir', 'div', 'dl', 'dt', 'em',
'font', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img',
'ins', 'kbd', 'label', 'legend', 'li', 'map', 'menu', 'ol',
'p', 'pre', 'q', 's', 'samp', 'small', 'span', 'strike',
'strong', 'sub', 'sup', 'table', 'tbody', 'td', 'tfoot', 'th',
'thead', 'tr', 'tt', 'u', 'ul', 'var']
acceptable_attributes = ['abbr', 'accept', 'accept-charset', 'accesskey',
'action', 'align', 'alt', 'axis', 'border', 'cellpadding', 'cellspacing',
'char', 'charoff', 'charset', 'checked', 'cite', 'clear', 'cols',
'colspan', 'color', 'compact', 'coords', 'datetime', 'dir',
'enctype', 'for', 'headers', 'height', 'href', 'hreflang', 'hspace',
'id', 'ismap', 'label', 'lang', 'longdesc', 'maxlength', 'method',
'multiple', 'name', 'nohref', 'noshade', 'nowrap', 'prompt',
'rel', 'rev', 'rows', 'rowspan', 'rules', 'scope', 'shape', 'size',
'span', 'src', 'start', 'summary', 'tabindex', 'target', 'title', 'type',
'usemap', 'valign', 'value', 'vspace', 'width']
def clean_html( fragment ):
while True:
soup = BeautifulSoup( fragment )
removed = False
for tag in soup.findAll(True): # find all tags
if tag.name not in acceptable_elements:
tag.extract() # remove the bad ones
removed = True
else: # it might have bad attributes
# a better way to get all attributes?
for attr in tag._getAttrMap().keys():
if attr not in acceptable_attributes:
del tag[attr]
# turn it back to html
fragment = unicode(soup)
if removed:
# we removed tags and tricky can could exploit that!
# we need to reparse the html until it stops changing
continue # next round
return fragment
tests = [ #text should work
('<p>this is text</p>but this too', '<p>this is text</p>but this too'),
# make sure we cant exploit removal of tags
('<<script></script>script> alert("Haha, I hacked your page."); <<script></script>/script>', ''),
# try the same trick with attributes, gives an Exception
('<div on<script></script>load="alert("Haha, I hacked your page.");">1</div>', Exception),
# no tags should be skipped
('<script>bad</script><script>bad</script><script>bad</script>', ''),
# leave valid tags but remove bad attributes
('<a href="good" onload="bad" onclick="bad" alt="good">1</div>', '<a href="good" alt="good">1</a>'),
]
for text, out in tests:
try:
res = clean_html(text)
assert res == out, "%s => %s != %s" % (text, res, out)
except out, e:
assert isinstance(e, out), "Wrong exception %r" % e
使用更多有用的选项效果更好。它基于html5lib构建,可以投入生产。请查阅Bleach功能的文档。它的默认配置会转义bleack.clean
等不安全标签,同时允许bleack.clean
等有用标签。
<script>
我修改了<a>
的import bleach
bleach.clean("<script>evil</script> <a href='http://example.com'>example</a>")
# '<script>evil</script> <a href="http://example.com">example</a>'
以寻址Bryan。有点粗糙,但能完成工作:
solution with BeautifulSoup
编辑:已更新为支持有效属性。
我使用problem raised by Chris Drost。它很简单,可让您定义一个控制良好的白名单,清理URL,甚至将属性值与regex匹配,或对每个属性具有自定义过滤功能。如果小心使用,可能是安全的解决方案。这是自述文件中的简化示例:
from BeautifulSoup import BeautifulSoup, Comment VALID_TAGS = {'strong': [], 'em': [], 'p': [], 'ol': [], 'ul': [], 'li': [], 'br': [], 'a': ['href', 'title'] } def sanitize_html(value, valid_tags=VALID_TAGS): soup = BeautifulSoup(value) comments = soup.findAll(text=lambda text:isinstance(text, Comment)) [comment.extract() for comment in comments] # Some markup can be crafted to slip through BeautifulSoup's parser, so # we run this repeatedly until it generates the same output twice. newoutput = soup.renderContents() while 1: oldoutput = newoutput soup = BeautifulSoup(newoutput) for tag in soup.findAll(True): if tag.name not in valid_tags: tag.hidden = True else: tag.attrs = [(attr, value) for attr, value in tag.attrs if attr in valid_tags[tag.name]] newoutput = soup.renderContents() if oldoutput == newoutput: break return newoutput
您可以使用FilterHTML,它使用白名单进行清理。
示例:
import FilterHTML
# only allow:
# <a> tags with valid href URLs
# <img> tags with valid src URLs and measurements
whitelist = {
'a': {
'href': 'url',
'target': [
'_blank',
'_self'
],
'class': [
'button'
]
},
'img': {
'src': 'url',
'width': 'measurement',
'height': 'measurement'
},
}
filtered_html = FilterHTML.filter_html(unfiltered_html, whitelist)
我更喜欢html5lib解决方案,例如import html5lib
from html5lib import sanitizer, treebuilders, treewalkers, serializer
def clean_html(buf):
"""Cleans HTML of dangerous tags and content."""
buf = buf.strip()
if not buf:
return buf
p = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"),
tokenizer=sanitizer.HTMLSanitizer)
dom_tree = p.parseFragment(buf)
walker = treewalkers.getTreeWalker("dom")
stream = walker(dom_tree)
s = serializer.htmlserializer.HTMLSerializer(
omit_optional_tags=False,
quote_attr_values=True)
return s.render(stream)
lxml.html.clean
。这里也要删除一些空标签:
nosklo