我正在通过以下方式从一组文本文件创建语料库:
newcorpus = PlaintextCorpusReader(corpus_root, '.*')
现在我希望以下列方式访问文件的单词:
text_bow = newcorpus.words("file_name.txt")
但是我收到以下错误:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 0: invalid start byte
有多个文件抛出是错误的。如何摆脱这个UnicodeDecodeError?
要解决解码错误,请执行以下操作之一。
首先,找到我们对您的文件编码的编码。也许尝试https://stackoverflow.com/a/16203777/610569或询问您的数据来源。
然后在encoding=
中使用PlaintextCorpusReader
参数,例如:对于latin-1
:
newcorpus = PlaintextCorpusReader(corpus_root, '.*', encoding='latin-1')
从代码https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py:
class PlaintextCorpusReader(CorpusReader):
"""
Reader for corpora that consist of plaintext documents. Paragraphs
are assumed to be split using blank lines. Sentences and words can
be tokenized using the default tokenizers, or by custom tokenizers
specificed as parameters to the constructor.
This corpus reader can be customized (e.g., to skip preface
sections of specific document formats) by creating a subclass and
overriding the ``CorpusView`` class variable.
"""
CorpusView = StreamBackedCorpusView
"""The corpus view class used by this reader. Subclasses of
``PlaintextCorpusReader`` may specify alternative corpus view
classes (e.g., to skip the preface sections of documents.)"""
def __init__(self, root, fileids,
word_tokenizer=WordPunctTokenizer(),
sent_tokenizer=nltk.data.LazyLoader(
'tokenizers/punkt/english.pickle'),
para_block_reader=read_blankline_block,
encoding='utf8'):