我有一个TXT文件,有近200万条记录,这些记录都是用“|”分隔的值没有标题,所以它们看起来像这样:
340658221|1540.0|1500.0|40.0|2023-10-23|PAGADO|
342103146|650.5|645.0|0.5|2023-10-23|PENDIENTE|
348263107|0.0|0.0|0.5|2023-09-08|LIQUIDADO|
我尝试使用以下代码将其转换为数据框:
datos_tmp = pd.read_csv('fuente_tmp.txt', encoding='latin-1', sep="|")
但是内容是这样的:
ID_CLIENTE、SALDO、资本、兴趣、FECHA_CORTE、ESTATUS | ||
---|---|---|
340658221 | “|1540.0|1500.0|40.0|” | “2023-10-23|帕加多|” |
342103146 | “|650.5|645.0|0.5|” | “2023-10-23|悬挂|” |
348263107 | “|0.0|0.0|0.5|” | “2023-09-08|液化气|” |
这个想法是它看起来像这样:
ID_CLIENTE | 萨尔多 | 资本 | 兴趣 | FECHA_CORTE | 状态 |
---|---|---|---|---|---|
340658221 | 1540.0 | 1500.0 | 40.0 | 2023-10-23 | 帕加多 |
342103146 | 650.5 | 645.0 | 5.5 | 2023-10-23 | 吊坠 |
348263107 | 0.0 | 0.0 | 0.5 | 2023-09-08 | 液化酒 |
我尝试了如下新技巧:
# Columns declaration
columnas = ['ID_CLIENTE,SALDO,CAPITAL,INTERES,FECHA_CORTE,ESTATUS']
# CSV load and dataframe creation
datos_tmp = pd.read_csv(archivo_txt, names=columnas, encoding='latin-1', sep="\t", header=None)
# Convert object column to str
datos_tmp['ID_CLIENTE,SALDO,CAPITAL,INTERES,FECHA_CORTE,ESTATUS'].astype(str)
# Replace commas by spaces
datos_tmp= datos_tmp.apply(lambda x: x.str.replace(",", " "))
# Remove "|" from the end of the line
for fila in datos_tmp:
valores = fila.rstrip("|")
# Replace "|" by commas
datos_tmp = datos_tmp.apply(lambda x: x.str.replace("|", ","))
# CSV file path mapping for result
datos_tmp_nuevo = r'fuente_tmp_nuevo.csv'
# Convert dataframe to CSV
datos_tmp.to_csv(datos_tmp_nuevo, sep=",", index=False)
# Loading the generated file, now with commas, to create the final dataframe
datos_saldo = pd.read_csv(datos_tmp_nuevo , encoding='latin-1', sep=",")
# Verifying dataframe dimensions
datos_saldo.shape
#The result is 4, 1, instead of 4, 6
# Column separation using commas
datos_saldo = pd.DataFrame([x.split(",") for x in datos_tmp])
# New verifying dataframe dimensions
datos_saldo.shape
#Now the result is 2, 1
我该如何解决这个问题?
根据您的示例,以下内容正确无误:
import pandas as pd
import io
data = io.StringIO('''\
340658221|1540.0|1500.0|40.0|2023-10-23|PAGADO|
342103146|650.5|645.0|0.5|2023-10-23|PENDIENTE|
348263107|0.0|0.0|0.5|2023-09-08|LIQUIDADO|
''')
names='ID_CLIENTE,SALDO,CAPITAL,INTERES,FECHA_CORTE,ESTATUS'.split(',')
datos_tmp = pd.read_csv(data, sep='|', names=names, index_col=False)
print(datos_tmp)
输出:
ID_CLIENTE SALDO CAPITAL INTERES FECHA_CORTE ESTATUS
0 340658221 1540.0 1500.0 40.0 2023-10-23 PAGADO
1 342103146 650.5 645.0 0.5 2023-10-23 PENDIENTE
2 348263107 0.0 0.0 0.5 2023-09-08 LIQUIDADO