Python'latin-1'编解码器无法编码字符 - 如何忽略字符？

Question

这是我的代码的要点。它试图从旧网站获取一些文本。这不是我的，所以我无法改变消息来源。

from bs4 import BeautifulSoup
import requests

response = requests.get("https://mattgemmell.com/network-link-conditioner-in-lion/")
data = response.text
soup = BeautifulSoup(data, 'lxml')
article = soup.find_all('article')[0]
text = article.find_all('p')[1].text 
print(text)

给出这个：

'如果您是使用网络的Mac或iOS应用程序的开发者，那么在Mac OS X 10.7的开发人员工具中有一个新功能â€\ x80 \x9cLionâ\ x80 \ x9d（阅读我的评论它在卫报上）对你有用。这篇简短的文章介绍了它的工作原理。

我可以用它来转换像â\ x80 \ x99这样的部件：

converted_text = bytes(text, 'latin-1').decode('utf-8')

实际上有效。

但是如果你得到文本的不同部分：

text = article.find_all('p')[8].text

给我：

'\ n←在线上找文字模式\在OS X Lion上使用空格→\ n'

并使用bytes(text, 'latin-1')给我：

'latin-1'编解码器无法对位置1中的字符'\ u2190'进行编码：序数不在范围内（256）

我认为这是箭头？我怎样才能使所有非拉丁字符被自动忽略和丢弃。

任何想法都会最有帮助！

Answer 1

你不想忽略这些字符。它们是您收到的数据使用错误的字符编码解码的症状。在你的情况下，requests错误地猜测编码是latin-1。实际编码是utf-8，并在HTML响应中的<meta>标记中指定。 requests是一个使用HTTP的库，它不了解HTML。由于Content-Type标头没有指定编码requests使用猜测编码。然而，BeautifulSoup是一个使用HTML的库，它非常擅长检测编码。因此，您希望从响应中获取原始字节并将其传递给BeautifulSoup。即。

from bs4 import BeautifulSoup
import requests

response = requests.get("https://mattgemmell.com/network-link-conditioner-in-lion/")
data = response.content # we now get `content` rather than `text`
assert type(data) is bytes
soup = BeautifulSoup(data, 'lxml')
article = soup.find_all('article')[0]
text = article.find_all('p')[1].text 
print(text)

assert type(text) is str
assert 'Mac OS X 10.7 “Lion”' in text

Answer 2

使用bytes的第三个参数告诉它如何处理错误：

converted_text = bytes(text, 'latin-1', 'ignore')
                                         ^^^^^^

你将失去箭头，但其他一切都完好无损：

>>> text = '\n← Find Patterns in text on Lion\nUsing Spaces on OS X Lion →\n'
>>> converted_text = bytes(text, 'latin-1', 'ignore')
>>> converted_text
'\n Find Patterns in text on Lion\nUsing Spaces on OS X Lion \n'

以下是有关文档中参数的更多信息 - https://docs.python.org/3.3/howto/unicode.html：

errors参数指定无法根据编码规则转换输入字符串时的响应。此参数的合法值是'strict'（引发UnicodeDecodeError异常），'replace'（使用U + FFFD，REPLACEMENT CHARACTER）或'ignore'（只是将字符留在Unicode结果之外）。

Python'latin-1'编解码器无法编码字符 - 如何忽略字符？

问题描述投票：1回答：2

2个回答

最新问题

Python'latin-1'编解码器无法编码字符 - 如何忽略字符？

问题描述 投票：1回答：2

2个回答

最新问题

问题描述投票：1回答：2