从Python字符串中删除不在允许列表中的HTML标签

Question

我有一个包含文本和HTML的字符串。我想删除或以其他方式禁用某些HTML标记，例如<script>，同时允许其他标记，以便我可以在网页上安全地呈现它。我有一个允许标签的列表，如何处理该字符串以删除任何其他标签？

Answer 1

这是使用BeautifulSoup的简单解决方案：

from bs4 import BeautifulSoup

VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br']

def sanitize_html(value):

    soup = BeautifulSoup(value)

    for tag in soup.findAll(True):
        if tag.name not in VALID_TAGS:
            tag.hidden = True

    return soup.renderContents()

如果还要删除无效标签的内容，请用tag.extract()代替tag.hidden。

您也可以考虑使用lxml和Tidy。

Answer 2

使用lxml.html.clean！非常简单！

lxml.html.clean

假设以下html：

from lxml.html.clean import clean_html
print clean_html(html)

结果...

html = '''\
<html>
 <head>
   <script type="text/javascript" src="evil-site"></script>
   <link rel="alternate" type="text/rss" src="evil-rss">
   <style>
     body {background-image: url(javascript:do_evil)};
     div {color: expression(evil)};
   </style>
 </head>
 <body onload="evil_function()">
    <!-- I am interpreted for EVIL! -->
   <a href="javascript:evil_function()">a link</a>
   <a href="#" onclick="evil_function()">another link</a>
   <p onclick="evil_function()">a paragraph</p>
   <div style="display: none">secret EVIL!</div>
   <object> of EVIL! </object>
   <iframe src="evil-site"></iframe>
   <form action="evil-site">
     Password: <input type="password" name="password">
   </form>
   <blink>annoying EVIL!</blink>
   <a href="evil-site">spam spam SPAM!</a>
   <image src="evil!">
 </body>
</html>'''

您可以自定义要清除的元素，什么都可以。

Answer 3

以上通过Beautiful Soup提供的解决方案将不起作用。您可能可以使用“美丽的汤”来破解某些东西，因为“美丽的汤”提供了对解析树的访问。一段时间后，我认为我会尽力解决问题，但这是一个为期一周的项目，而且我很快就没有空闲的一周。

只是具体而言，Beautiful Soup不仅会因上述代码无法捕获的某些分析错误而引发异常；而且，还有很多尚未发现的非常真实的XSS漏洞，例如：

<html>
  <body>
    <div>
      <style>/* deleted */</style>
      <a href="">a link</a>
      <a href="#">another link</a>
      <p>a paragraph</p>
      <div>secret EVIL!</div>
      of EVIL!
      Password:
      annoying EVIL!
      <a href="evil-site">spam spam SPAM!</a>
      <img src="evil!">
    </div>
  </body>
</html>

[也许最好的办法是改为将<<script>script> alert("Haha, I hacked your page."); </</script>script>元素剥离为<，禁止all HTML，然后使用Markdown之类的受限子集来正确呈现格式。特别是，您还可以返回并使用正则表达式重新引入HTML的通用位。大致如下所示：

&lt;

我尚未测试该代码，因此可能存在错误。但是您会看到一个大致的想法：将所有HTML列入白名单之前，必须将所有HTML列入黑名单。

Answer 4

这是我在自己的项目中使用的。 accept_elements / attributes来自_lt_ = re.compile('<') _tc_ = '~(lt)~' # or whatever, so long as markdown doesn't mangle it. _ok_ = re.compile(_tc_ + '(/?(?:u|b|i|em|strong|sup|sub|p|br|q|blockquote|code))>', re.I) _sqrt_ = re.compile(_tc_ + 'sqrt>', re.I) #just to give an example of extending _endsqrt_ = re.compile(_tc_ + '/sqrt>', re.I) #html syntax with your own elements. _tcre_ = re.compile(_tc_) def sanitize(text): text = _lt_.sub(_tc_, text) text = markdown(text) text = _ok_.sub(r'<\1>', text) text = _sqrt_.sub(r'√<span style="text-decoration:overline;">', text) text = _endsqrt_.sub(r'</span>', text) return _tcre_.sub('<', text)，BeautifulSoup完成了工作。

feedparser

一些小测试以确保其行为正确：

from BeautifulSoup import BeautifulSoup

acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area', 'b', 'big',
      'blockquote', 'br', 'button', 'caption', 'center', 'cite', 'code', 'col',
      'colgroup', 'dd', 'del', 'dfn', 'dir', 'div', 'dl', 'dt', 'em',
      'font', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img', 
      'ins', 'kbd', 'label', 'legend', 'li', 'map', 'menu', 'ol', 
      'p', 'pre', 'q', 's', 'samp', 'small', 'span', 'strike',
      'strong', 'sub', 'sup', 'table', 'tbody', 'td', 'tfoot', 'th',
      'thead', 'tr', 'tt', 'u', 'ul', 'var']

acceptable_attributes = ['abbr', 'accept', 'accept-charset', 'accesskey',
  'action', 'align', 'alt', 'axis', 'border', 'cellpadding', 'cellspacing',
  'char', 'charoff', 'charset', 'checked', 'cite', 'clear', 'cols',
  'colspan', 'color', 'compact', 'coords', 'datetime', 'dir', 
  'enctype', 'for', 'headers', 'height', 'href', 'hreflang', 'hspace',
  'id', 'ismap', 'label', 'lang', 'longdesc', 'maxlength', 'method',
  'multiple', 'name', 'nohref', 'noshade', 'nowrap', 'prompt', 
  'rel', 'rev', 'rows', 'rowspan', 'rules', 'scope', 'shape', 'size',
  'span', 'src', 'start', 'summary', 'tabindex', 'target', 'title', 'type',
  'usemap', 'valign', 'value', 'vspace', 'width']

def clean_html( fragment ):
    while True:
        soup = BeautifulSoup( fragment )
        removed = False        
        for tag in soup.findAll(True): # find all tags
            if tag.name not in acceptable_elements:
                tag.extract() # remove the bad ones
                removed = True
            else: # it might have bad attributes
                # a better way to get all attributes?
                for attr in tag._getAttrMap().keys():
                    if attr not in acceptable_attributes:
                        del tag[attr]

        # turn it back to html
        fragment = unicode(soup)

        if removed:
            # we removed tags and tricky can could exploit that!
            # we need to reparse the html until it stops changing
            continue # next round

        return fragment

Answer 5

tests = [ #text should work ('<p>this is text</p>but this too', '<p>this is text</p>but this too'), # make sure we cant exploit removal of tags ('<<script></script>script> alert("Haha, I hacked your page."); <<script></script>/script>', ''), # try the same trick with attributes, gives an Exception ('<div on<script></script>load="alert("Haha, I hacked your page.");">1</div>', Exception), # no tags should be skipped ('<script>bad</script><script>bad</script><script>bad</script>', ''), # leave valid tags but remove bad attributes ('<a href="good" onload="bad" onclick="bad" alt="good">1</div>', '<a href="good" alt="good">1</a>'), ] for text, out in tests: try: res = clean_html(text) assert res == out, "%s => %s != %s" % (text, res, out) except out, e: assert isinstance(e, out), "Wrong exception %r" % e使用更多有用的选项效果更好。它基于html5lib构建，可以投入生产。请查阅Bleach功能的文档。它的默认配置会转义bleack.clean等不安全标签，同时允许bleack.clean等有用标签。

<script>

Answer 6

我修改了<a>的import bleach bleach.clean("<script>evil</script> <a href='http://example.com'>example</a>") # '<script>evil</script> <a href="http://example.com">example</a>'以寻址Bryan。有点粗糙，但能完成工作：

solution with BeautifulSoup

编辑：已更新为支持有效属性。

Answer 7

我使用problem raised by Chris Drost。它很简单，可让您定义一个控制良好的白名单，清理URL，甚至将属性值与regex匹配，或对每个属性具有自定义过滤功能。如果小心使用，可能是安全的解决方案。这是自述文件中的简化示例：

from BeautifulSoup import BeautifulSoup, Comment

VALID_TAGS = {'strong': [],
              'em': [],
              'p': [],
              'ol': [],
              'ul': [],
              'li': [],
              'br': [],
              'a': ['href', 'title']
              }

def sanitize_html(value, valid_tags=VALID_TAGS):
    soup = BeautifulSoup(value)
    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
    [comment.extract() for comment in comments]
    # Some markup can be crafted to slip through BeautifulSoup's parser, so
    # we run this repeatedly until it generates the same output twice.
    newoutput = soup.renderContents()
    while 1:
        oldoutput = newoutput
        soup = BeautifulSoup(newoutput)
        for tag in soup.findAll(True):
            if tag.name not in valid_tags:
                tag.hidden = True
            else:
                tag.attrs = [(attr, value) for attr, value in tag.attrs if attr in valid_tags[tag.name]]
        newoutput = soup.renderContents()
        if oldoutput == newoutput:
            break
    return newoutput

Answer 8

您可以使用FilterHTML，它使用白名单进行清理。

示例：

import FilterHTML

# only allow:
#   <a> tags with valid href URLs
#   <img> tags with valid src URLs and measurements
whitelist = {
  'a': {
    'href': 'url',
    'target': [
      '_blank',
      '_self'
    ],
    'class': [
      'button'
    ]
  },
  'img': {
    'src': 'url',
    'width': 'measurement',
    'height': 'measurement'
  },
}

filtered_html = FilterHTML.filter_html(unfiltered_html, whitelist)

Answer 9

我更喜欢html5lib解决方案，例如import html5lib from html5lib import sanitizer, treebuilders, treewalkers, serializer def clean_html(buf): """Cleans HTML of dangerous tags and content.""" buf = buf.strip() if not buf: return buf p = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"), tokenizer=sanitizer.HTMLSanitizer) dom_tree = p.parseFragment(buf) walker = treewalkers.getTreeWalker("dom") stream = walker(dom_tree) s = serializer.htmlserializer.HTMLSerializer( omit_optional_tags=False, quote_attr_values=True) return s.render(stream) lxml.html.clean。这里也要删除一些空标签：

nosklo

从Python字符串中删除不在允许列表中的HTML标签

问题描述投票：67回答：9

9个回答

最新问题

从Python字符串中删除不在允许列表中的HTML标签

问题描述 投票：67回答：9

9个回答

最新问题

问题描述投票：67回答：9