http:/www.jcpjournal.orgjournalview.html?doi=10.15430JCP.2018.23.2.70
如果我使用下面的Python代码来解析上面的HTML页面,我将获得 UnicodeDecodeError
.
from lxml import html
doc = html.parse(sys.stdin, parser = html.HTMLParser(encoding='utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 5365: invalid start byte
如果我用 iconv -f utf-8 -t utf-8 -c
然后再运行同样的python代码,我还是会得到 UnicodeDecodeError
. 有什么健壮的过滤器(不知道输入HTML的编码),使过滤后的结果总是与python代码一起工作?谢谢。
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 5418: invalid continuation byte
EDIT:下面是使用的命令。
$ wget 'http://www.jcpjournal.org/journal/view.html?doi=10.15430/JCP.2018.23.2.70'
$ ./main.py < 'view.html?doi=10.15430%2FJCP.2018.23.2.70'
Traceback (most recent call last):
File "./main.py", line 6, in <module>
doc = html.parse(sys.stdin, parser = html.HTMLParser(encoding='utf-8'))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/lxml/html/__init__.py", line 939, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "src/lxml/etree.pyx", line 3519, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1860, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1880, in lxml.etree._parseFilelikeDocument
File "src/lxml/parser.pxi", line 1775, in lxml.etree._parseDocFromFilelike
File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 707, in lxml.etree._handleParseResult
File "src/lxml/etree.pyx", line 318, in lxml.etree._ExceptionContext._raise_if_stored
File "src/lxml/parser.pxi", line 370, in lxml.etree._FileReaderContext.copyToBuffer
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 5365: invalid start byte
$ iconv -f utf-8 -t utf-8 -c < 'view.html?doi=10.15430%2FJCP.2018.23.2.70' | ./main.py
Traceback (most recent call last):
File "./main.py", line 6, in <module>
doc = html.parse(sys.stdin, parser = html.HTMLParser(encoding='utf-8'))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/lxml/html/__init__.py", line 939, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "src/lxml/etree.pyx", line 3519, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1860, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1880, in lxml.etree._parseFilelikeDocument
File "src/lxml/parser.pxi", line 1775, in lxml.etree._parseDocFromFilelike
File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 707, in lxml.etree._handleParseResult
File "src/lxml/etree.pyx", line 318, in lxml.etree._ExceptionContext._raise_if_stored
File "src/lxml/parser.pxi", line 370, in lxml.etree._FileReaderContext.copyToBuffer
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 5418: invalid continuation byte
经过研究,我发现这个文件并不在 utf-8
但在 latin1
而问题 sys.stdin
其中使用 utf-8
. 但你不能直接在这里改变编码。sys.stdin
. 你必须使用 sys.stdin
创建新的编码流。
主拉丁语1.py
import sys
import io
from lxml import html
#input_stream = sys.stdin # gives error
input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='latin1')
doc = html.parse(input_stream)
print(html.tostring(doc))
现在你可以运行
cat 'view.html?doi=10.15430%2FJCP.2018.23.2.70' | python main-latin1.py
EDIT: 你也可以在控制台用 iconv -f latin1 -t utf-8
cat 'view.html?doi=10.15430%2FJCP.2018.23.2.70' | iconv -f latin1 -t utf-8 | python main-utf8.py
主-utf8.py
import sys
from lxml import html
doc = html.parse(sys.stdin)
print(html.tostring(doc))
另外。 直接从页面上读取它没有问题,使用 requests
import requests
from lxml import html
r = requests.get('http://www.jcpjournal.org/journal/view.html?doi=10.15430/JCP.2018.23.2.70')
doc = html.fromstring(r.text)
print(html.tostring(doc))
EDIT: 你可以把数据读成字节,然后用 for
-循环和 try/except
以不同的编码进行解码。
你在运行它时不需要 <
myscript filename.html
import sys
from lxml import html
# --- function ---
def decode(data, encoding):
try:
return data.decode(encoding)
except:
pass
# --- main ---
# only for test
#sys.argv.append('view.html?doi=10.15430%2FJCP.2018.23.2.70')
if len(sys.argv) == 1:
print('need file name')
exit(1)
data = open(sys.argv[1], 'rb').read()
for encoding in ('utf-8', 'latin1', 'cp1250'):
result = decode(data, encoding)
if result:
print('encoding:', encoding)
doc = html.fromstring(result)
#print(html.tostring(doc))
break
EDIT: 我试着用模块 chardet
(字符检测),它使用 requests
但它给我 windows-1252
(cp1252
)而不是 latin1
. 但由于某些原因 requests
没有任何问题,可以正确地得到它。
import sys
from lxml import html
import chardet
# only for test
#sys.argv.append('view.html?doi=10.15430%2FJCP.2018.23.2.70')
if len(sys.argv) == 1:
print('need file name')
exit(1)
data = open(sys.argv[1], 'rb').read()
encoding = chardet.detect(data)['encoding']
print('encoding:', encoding)
doc = html.fromstring(data.decode(encoding))
你可以使用 str = unicode(str, errors='ignore')
所谓 UnicodeDecodeError: 'utf8' 编解码器无法解码字节0x9c。. 这并不总是可取的,因为不可读的字符会被删除,但对于你的使用情况来说可能是好的。
或者 看来 lxml可以使用 encoding='unicode'
在某些情况下。你试过吗?