与this question类似,但我的CSV格式略有不同。这是一个例子:
id,employee,details,createdAt
1,John,"{"Country":"USA","Salary":5000,"Review":null}","2018-09-01"
2,Sarah,"{"Country":"Australia", "Salary":6000,"Review":"Hardworking"}","2018-09-05"
我认为JSON列开头的双引号可能会导致一些错误。使用df = pandas.read_csv('file.csv')
,这是我得到的数据帧:
id employee details createdAt Unnamed: 1 Unnamed: 2
1 John {Country":"USA" Salary:5000 Review:null}" 2018-09-01
2 Sarah {Country":"Australia" Salary:6000 Review:"Hardworking"}" 2018-09-05
我想要的输出:
id employee details createdAt
1 John {"Country":"USA","Salary":5000,"Review":null} 2018-09-01
2 Sarah {"Country":"Australia","Salary":6000,"Review":"Hardworking"} 2018-09-05
我已经尝试添加quotechar='"'
作为参数,它仍然没有给我我想要的结果。有没有办法告诉大熊猫忽略围绕json值的第一个和最后一个引号?
作为替代方法,您可以手动读取文件,正确解析每一行并使用生成的data
来构造数据帧。这可以通过向前和向后分割行来获得无问题的列,然后获取剩余部分:
import pandas as pd
data = []
with open("e1.csv") as f_input:
for row in f_input:
row = row.strip()
split = row.split(',', 2)
rsplit = [cell.strip('"') for cell in split[-1].rsplit(',', 1)]
data.append(split[0:2] + rsplit)
df = pd.DataFrame(data[1:], columns=data[0])
print(df)
这会将您的数据显示为:
id employee details createdAt
0 1 John {"Country":"USA","Salary":5000,"Review":null} 2018-09-01
1 2 Sarah {"Country":"Australia", "Salary":6000,"Review"... 2018-09-05
我已经复制了你的文件
df = pd.read_csv('e1.csv', index_col=None )
print (df)
产量
id emp details createdat
0 1 john "{"Country":"USA","Salary":5000,"Review":null}" "2018-09-01"
1 2 sarah "{"Country":"Australia", "Salary":6000,"Review... "2018-09-05"
我认为通过将正则表达式传递给sep=r',"|",|(?<=\d),'
以及可能的其他一些参数组合有更好的方法。我完全没想出来。
这是一个不太理想的选择:
df = pd.read_csv('s083838383.csv', sep='@#$%^', engine='python')
header = df.columns[0]
print(df)
为什么sep='@#$%^'
?这只是垃圾,允许您读取没有sep字符的文件。它可以是任何随机字符,仅用作将数据导入df
对象的工具。
df
看起来像这样:
id,employee,details,createdAt
0 1,John,"{"Country":"USA","Salary":5000,"Review...
1 2,Sarah,"{"Country":"Australia", "Salary":6000...
然后你可以使用str.extract
来应用正则表达式并扩展列:
result = df[header].str.extract(r'(.+),(.+),("\{.+\}"),(.+)',
expand=True).applymap(str.strip)
result.columns = header.strip().split(',')
print(result)
result
是:
id employee details createdAt
0 1 John "{"Country":"USA","Salary":5000,"Review":null}" "2018-09-01"
1 2 Sarah "{"Country":"Australia", "Salary":6000,"Review... "2018-09-05"
如果你需要从details
字符串值中删除起始和结束引号,你可以这样做:
result['details'] = result['details'].str.strip('"')
如果details
对象项需要是dict
s而不是字符串,你可以这样做:
from json import loads
result['details'] = result['details'].apply(loads)