DictReader和UnicodeError

问题描述 投票:1回答:2
def openFile(fileName):
    try:
       trainFile  = io.open(fileName,"r",encoding = "utf-8")
    except IOError as e:
       print ("File could not be opened: {}".format(e))
    else:
       trainData = csv.DictReader(trainFile)
       print trainData
       return trainData

def computeTFIDF(trainData):
     bodyList = []
     print "Inside computeTFIDF"
     for row in trainData:
        for key, value in row.iteritems():
             print key, unicode(value, "utf-8", "ignore")
     print "Done"
     return

 if __name__ == "__main__":
     print "Main"
     trainData = openFile("../Data/TrainSample.csv")
     print "File Opened"
     computeTFIDF(trainData)

错误:

Traceback (most recent call last):
  File "C:\DebSeal\IUB MS Program\IUB Sem III\Facebook Kaggle Comp\Src\facebookChallenge.py", line 62, in <module>
    computeTFIDF(trainData)
  File "C:\DebSeal\IUB MS Program\IUB Sem III\Facebook Kaggle Comp\Src\facebookChallenge.py", line 42, in computeTFIDF
    for row in trainData:
  File "C:\Python27\lib\csv.py", line 104, in next
    row = self.reader.next()
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 215: ordinal not in range(128)

TrainSample.csv:是一个包含4列的csv文件(带标题)。 操作系统:Windows 7 64位。 使用Python 2.x.

我不知道这里出了什么问题。我说要忽略编码。但仍然会抛出同样的错误。

我认为在控件达到编码之前,它会抛出一个错误。

谁能告诉我哪里出错了?

python python-2.7 csv unicode python-unicode
2个回答
4
投票

Python 2 CSV模块不处理Unicode输入。

以二进制模式打开文件,并在将其解析为CSV后进行解码。这对于UTF-8编解码器是安全的,因为换行符,分隔符和引号都编码为1个字节。

csv模块文档包含UnicodeReader中的example section包装类,它将为您进行解码;它很容易适应DictReader类:

import csv

class UnicodeDictReader:
    """
    A CSV reader which will iterate over lines in the CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        self.encoding = encoding
        self.reader = csv.DictReader(f, dialect=dialect, **kwds)

    def next(self):
        row = self.reader.next()
        return {k: unicode(v, "utf-8") for k, v in row.iteritems()}

    def __iter__(self):
        return self

在二进制模式下打开文件时使用此选项:

def openFile(fileName):
    try: 
        trainFile  = open(fileName, "rb")
    except IOError as e:
        print "File could not be opened: {}".format(e)
    else:
        return UnicodeDictReader(trainFile)

0
投票

我不能对Martijn发表评论,这个解决方案在我为其他人留下的小升级之后完全适合我:

    def next(self):
    row = self.reader.next()
    try:
        d = dict((unicode(k, self.encoding), unicode(v, self.encoding)) for k, v in row.iteritems())
    except TypeError:
        d = row
    return d

有一点是python 2.6和更低版本不支持dict comprahension。另外,dicts可以使用不同的类型,而unicode函数则不能,因此在null或number的情况下捕获TypeError是值得的。另一件让我开心的事情是,当你用编码打开文件时,它不起作用!只需简单的open()

© www.soinside.com 2019 - 2024. All rights reserved.