使用BeautifulSoup解析HTML时缺少特殊字符和标记

Question

我正在尝试使用BeautifulSoup和Python解析HTML文档。

但它停止解析特殊字符，如下所示：

from bs4 import BeautifulSoup
doc = '''
<html>
    <body>
        <div>And I said «What the %&#@???»</div>
        <div>some other text</div>
    </body>
</html>'''
soup = BeautifulSoup(doc,  'html.parser')
print(soup)

此代码应输出整个文档。相反，它只打印

<html>
<body>
<div>And I said «What the %</div></body></html>

该文件的其余部分显然已丢失。它被组合'&#'阻止了。

问题是，如何设置BS或预处理文档，以避免此类问题，但尽可能少丢失文本（可能提供信息）？

我在Windows 10上使用版本4.6.0的bs4和Python 3.6.1。

更新。方法soup.prettify()不起作用，因为soup已经被打破。

Answer 1

您需要在BeautifulSoup对象中使用“html5lib”作为解析器而不是“html.parser”。例如：

from bs4 import BeautifulSoup
doc = '''
<html>
    <body>
        <div>And I said «What the %&#@???»</div>
        <div>some other text</div>
    </body>
</html>'''

soup = BeautifulSoup(doc,  'html5lib')
#          different parser  ^

现在，如果您打印soup，它将显示您想要的字符串：

>>> print(soup)
<html><head></head><body>
        <div>And I said «What the %&amp;#@???»</div>
        <div>some other text</div>

</body></html>

来自Difference Between Parsers文件：

与html5lib不同，html.parser不会尝试通过添加标记来创建格式良好的HTML文档。与lxml不同，它甚至不需要添加标签。

使用BeautifulSoup解析HTML时缺少特殊字符和标记

问题描述投票：3回答：1

1个回答

最新问题

使用BeautifulSoup解析HTML时缺少特殊字符和标记

问题描述 投票：3回答：1

1个回答

最新问题

问题描述投票：3回答：1