我可以导入CSV文件并自动推断分隔符吗?

问题描述 投票:43回答:5

我想导入两种CSV文件,有些使用“;”用于分隔符和其他人使用“,”。到目前为止,我一直在接下来的两行之间切换:

reader=csv.reader(f,delimiter=';')

要么

reader=csv.reader(f,delimiter=',')

是否有可能不指定分隔符并让程序检查正确的分隔符?

下面的解决方案(Blender和sharth)似乎适用于以逗号分隔的文件(使用Libroffice生成),但不适用于以分号分隔的文件(使用MS Office生成)。以下是一个以分号分隔的文件的第一行:

ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes
1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
python file csv import delimiter
5个回答
45
投票

csv模块似乎建议使用csv sniffer来解决这个问题。

他们给出了以下示例,我已根据您的情况进行了调整。

with open('example.csv', 'rb') as csvfile:  # python 3: 'r',newline=""
    dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=";,")
    csvfile.seek(0)
    reader = csv.reader(csvfile, dialect)
    # ... process CSV file contents here ...

我们来试试吧。

[9:13am][wlynch@watermelon /tmp] cat example 
#!/usr/bin/env python
import csv

def parse(filename):
    with open(filename, 'rb') as csvfile:
        dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,')
        csvfile.seek(0)
        reader = csv.reader(csvfile, dialect)

        for line in reader:
            print line

def main():
    print 'Comma Version:'
    parse('comma_separated.csv')

    print
    print 'Semicolon Version:'
    parse('semicolon_separated.csv')

    print
    print 'An example from the question (kingdom.csv)'
    parse('kingdom.csv')

if __name__ == '__main__':
    main()

我们的样本输入

[9:13am][wlynch@watermelon /tmp] cat comma_separated.csv 
test,box,foo
round,the,bend

[9:13am][wlynch@watermelon /tmp] cat semicolon_separated.csv 
round;the;bend
who;are;you

[9:22am][wlynch@watermelon /tmp] cat kingdom.csv 
ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes
1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document

如果我们执行示例程序:

[9:14am][wlynch@watermelon /tmp] ./example 
Comma Version:
['test', 'box', 'foo']
['round', 'the', 'bend']

Semicolon Version:
['round', 'the', 'bend']
['who', 'are', 'you']

An example from the question (kingdom.csv)
['ReleveAnnee', 'ReleveMois', 'NoOrdre', 'TitreRMC', 'AdopCSRegleVote', 'AdopCSAbs', 'AdoptCSContre', 'NoCELEX', 'ProposAnnee', 'ProposChrono', 'ProposOrigine', 'NoUniqueAnnee', 'NoUniqueType', 'NoUniqueChrono', 'PropoSplittee', 'Suite2LecturePE', 'Council PATH', 'Notes']
['1999', '1', '1', '1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC', 'U', '', '', '31999D0083', '1998', '577', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document']
['1999', '1', '2', '1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes', 'U', '', '', '31999D0081', '1998', '184', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document']

它也可能值得注意我使用的是什么版本的python。

[9:20am][wlynch@watermelon /tmp] python -V
Python 2.7.2

7
投票

为了解决这个问题,我创建了一个函数,它读取文件的第一行(标题)并检测分隔符。

def detectDelimiter(csvFile):
    with open(csvFile, 'r') as myCsvfile:
        header=myCsvfile.readline()
        if header.find(";")!=-1:
            return ";"
        if header.find(",")!=-1:
            return ","
    #default delimiter (MS Office export)
    return ";"

7
投票

给定一个处理两者的项目,(逗号)和| (垂直条)分隔的CSV文件,形式良好,我尝试了以下(在https://docs.python.org/2/library/csv.html#csv.Sniffer给出):

dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=',|')

但是,在| -delimited文件上,返回了“无法确定分隔符”异常。如果每条线具有相同数量的分隔符(不包括引号中可能包含的内容),推测嗅探启发式可能最有效是合理的。因此,我没有读取文件的前1024个字节,而是尝试完整地阅读前两行:

temp_lines = csvfile.readline() + '\n' + csvfile.readline()
dialect = csv.Sniffer().sniff(temp_lines, delimiters=',|')

到目前为止,这对我来说效果很好。


6
投票

如果您使用DictReader,您可以这样做:

#!/usr/bin/env python
import csv

def parse(filename):
    with open(filename, 'rb') as csvfile:
        dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,')
        csvfile.seek(0)
        reader = csv.DictReader(csvfile, dialect=dialect)

        for line in reader:
            print(line['ReleveAnnee'])

我和Python 3.5一起使用它,它就是这样的。


2
投票

我不认为可以有一个完美的通用解决方案(我可能使用,作为分隔符的原因之一是我的一些数据字段需要能够包括; ...)。一个简单的启发式决定可能只是简单地读取第一行(或更多),计算它包含多少,;字符(可能忽略那些内部引号,如果有什么创建你的.csv文件引用条目正确和一致),并猜测两者中更常见的是正确的分隔符。

© www.soinside.com 2019 - 2024. All rights reserved.