导入CSV文件,其值包含在“当其中一些包含”以及逗号时

问题描述 投票:4回答:2

我想我一直在搜索,但如果我错过了什么 - 请告诉我。

我正在尝试导入CSV文件,其中所有非数值都包含“。我遇到了一个问题:

 df = pd.read_csv(file.csv)

CSV示例:

"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company "MoscowMining" Owner1, Owner2, Owner3"
"Agriculture","Poland","Warsaw","Company" Jankowski,A,B""

由于其中有多个引号和逗号,在这种情况下,pandas会看到比4更多的列(如5或6)。

我已经尝试过了

df = pd.read_csv(file.csv, quotechar='"', quoting=2)

但得到了

ParserError: Error tokenizing data (...)

什么有效是跳过坏线

error_bad_lines=False

但我宁愿把所有数据都考虑在内,而不仅仅是省略它。

非常感谢您的帮助!

python pandas csv quotes
2个回答
2
投票

这看起来像是格式错误的CSV数据,因为值中的'''字符应该被转义。我经常看到这些值通过将它们加倍或以前缀为\来逃脱。请参阅https://en.wikipedia.org/wiki/Comma-separated_values#cite_ref-13

我要做的第一件事是修复导出这些文件的内容。但是,如果你不能这样做,你可以通过逃避“这是价值的一部分来解决问题。

你最好的选择可能是假设“只有一个逗号或换行符跟着(或者先于),如果它是一个值的结尾。那么你可以做一个正则表达式(从内存工作所以可能不是100% - 但是应该给你正确的想法。你必须适应任何你方便的正则表达式库)

s/([^,\n])"([^,\n])/$1""$2/g

因此,如果你要运行你的示例文件,它会被转义为这样:

"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company ""MoscowMining"" Owner1, Owner2, Owner3"
"Agriculture","Poland","Warsaw","Company"" Jankowski,A,B"""

或使用以下内容

s/([^,\n])"([^,\n])/$1\"$2/g

该文件将被转义如下:

"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company \"MoscowMining\" Owner1, Owner2, Owner3"
"Agriculture","Poland","Warsaw","Company\" Jankowski,A,B\""

根据您的CSV解析器,其中一个应该被接受并按预期工作。

如果@exe建议您的CSV解析器还要求转义值中的逗号,则可以应用类似的正则表达式来替换逗号。


0
投票

如果我理解你需要的是在熊猫阅读csv之前施放引号和逗号。

像这些:

"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company \"MoscowMining\" Owner1\, Owner2\, Owner3"
"Agriculture","Poland","Warsaw","Company\" Jankowski\,A\,B\""
© www.soinside.com 2019 - 2024. All rights reserved.