导入带有不需要的字符、字符串的脏 csv 文件

Question

我想用 pandas 导入 csv 文件。通常我的数据以以下形式给出：

a,b,c,d
a1,b1,c1,d1
a2,b2,c2,d2

其中 a、b、c、d 是标题。我可以在这里轻松使用 pandas.read_csv。但是，现在我有这样存储的数据：

"a;b;c;d"
"a1;\"b1\";\"c1\";\"d1\""
"a2;\"b2\";\"c2\";\"d2\""

我怎样才能以最有效的方式清理它？如何删除整行周围的字符串以便它可以检测列？然后如何删除所有的“？

非常感谢您的帮助！

我不知道该怎么办。

Answer 1

您可以使用

sed

将文件分解为您选择的格式。

有关使用

sed

匹配您的问题的简单示例：

$ cat file 
"a1a1;"a1a1";"a1a1";"a1a1""
$ cat file | sed 's/"//g'
a1a1;a1a1;a1a1;a1a1

sed 's/"//g'

这将替换所有 " 字符，最后的 g 告诉 sed 对每个 " 字符执行此操作，而不仅仅是第一个找到的字符。

我看到你编辑了问题，这里是对新文本输出的更新：

$ cat file
"a1;\"b1\";\"c1\";\"d1\""
"a2;\"b2\";\"c2\";\"d2\""
$ cat file | sed 's/"//g' | sed 's|\\||g' 
a1;b1;c1;d1
a2;b2;c2;d2

Answer 2

当你需要/想要用 Python 做的时候：

只需删除开头和结尾的引号：



file1 = open('abcd.csv',"r")
file2 = open('abcd-new.csv',"w")
lines = file1.readlines()

for line in lines:
    if (line.startswith("\"") and line.endswith("\"")):
         line = line[1:len(line)-1] 
    print(line)
    file2.write(line)
file2.close()

当您还需要更换

\"

：



file1 = open('abcd.csv',"r")
file2 = open('abcd-new.csv',"w")
lines = file1.readlines()

for line in lines:
    if (line.startswith("\"") and line.endswith("\"")):
         line = line[1:len(line)-1] 
    line = line.replace("\"","")
    line = line.replace("\\","")
    print(line)
    file2.write(line)
file2.close()

Answer 3

这是一个选项

read_csv

（我相信我们可以做得更好）：

df = (pd.read_csv("input.csv", sep=r";|;\\?", engine="python")
      .replace(r'[\\"]', "", regex=True)
      .pipe(lambda df_: df_
            .set_axis(df_.columns.str.strip('"'), axis=1))
     )

输出：


print(df)

    a   b   c   d
0  a1  b1  c1  d1
1  a2  b2  c2  d2

导入带有不需要的字符、字符串的脏 csv 文件

问题描述投票：0回答：3

3个回答

最新问题

导入带有不需要的字符、字符串的脏 csv 文件

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3