NLTK - 在自定义语料库中解码Unicode

Question

我使用nltk的CategorizedPlaintextCorpusReader创建了一个自定义语料库。

我的语料库的.txt文件中有unicode字符，我无法解码。我认为这是一个“明文”阅读器的事实，但仍需解码。

码：

import nltk
from nltk.corpus import CategorizedPlaintextCorpusReader
import os



mr = CategorizedPlaintextCorpusReader('C:\mycorpus', r'(?!\.).*\.txt',
        cat_pattern=os.path.join(r'(neg|pos)', '.*',))

for w in mr.words():
    print(w)

这将以标记化格式打印不包含unicode的文件的单词，然后抛出以下错误：

for w in mr.words():
  File "C:\Python\Python36-32\lib\site-packages\nltk\corpus\reader\util.py", line 402, in iterate_from
    for tok in piece.iterate_from(max(0, start_tok-offset)):
  File "C:\Python\Python36-32\lib\site-packages\nltk\corpus\reader\util.py", line 296, in iterate_from
    tokens = self.read_block(self._stream)
  File "C:\Python\Python36-32\lib\site-packages\nltk\corpus\reader\plaintext.py", line 122, in _read_word_block
    words.extend(self._word_tokenizer.tokenize(stream.readline()))
  File "C:\Python\Python36-32\lib\site-packages\nltk\data.py", line 1168, in readline
    new_chars = self._read(readsize)
  File "C:\Python\Python36-32\lib\site-packages\nltk\data.py", line 1400, in _read
    chars, bytes_decoded = self._incr_decode(bytes)
  File "C:\Python\Python36-32\lib\site-packages\nltk\data.py", line 1431, in _incr_decode
    return self.decode(bytes, 'strict')
  File "C:\Python\Python36-32\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 30: invalid start byte

我试图解码：

mr.decode('unicode-escape')

这引发了这个错误：

AttributeError: 'CategorizedPlaintextCorpusReader' object has no attribute 'decode'

我使用的是Python 3.6.4。

Answer 1

问题是NLTK的语料库阅读器假设你的纯文本文件是用UTF-8编码保存的。但是，这个假设显然是错误的，因为文件是用另一个编解码器编码的。我的猜测是使用了CP1252（也就是“Windows Latin-1”），因为它很受欢迎，它很适合你的描述：在那个编码中，em短划线“ - ”用字节0x96编码，在错误信息。

您可以在语料库阅读器的构造函数中指定输入文件的编码：

mr = CategorizedPlaintextCorpusReader(
    'C:\mycorpus',
    r'(?!\.).*\.txt',
    cat_pattern=os.path.join(r'(neg|pos)', '.*',),
    encoding='cp1252')

试试这个，检查输出中的非ASCII字符（em dash，bullet）是否仍然正确（而不是用mojibake替换）。

NLTK - 在自定义语料库中解码Unicode

问题描述投票：1回答：1

1个回答

最新问题

NLTK - 在自定义语料库中解码Unicode

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1