转换文件,以“|”分隔到 Python 中的数据框

问题描述 投票:0回答:1

我有一个TXT文件,有近200万条记录,这些记录都是用“|”分隔的值没有标题,所以它们看起来像这样:

340658221|1540.0|1500.0|40.0|2023-10-23|PAGADO|
342103146|650.5|645.0|0.5|2023-10-23|PENDIENTE|
348263107|0.0|0.0|0.5|2023-09-08|LIQUIDADO|

我尝试使用以下代码将其转换为数据框:

datos_tmp = pd.read_csv('fuente_tmp.txt', encoding='latin-1', sep="|")

但是内容是这样的:

ID_CLIENTE、SALDO、资本、兴趣、FECHA_CORTE、ESTATUS
340658221 “|1540.0|1500.0|40.0|” “2023-10-23|帕加多|”
342103146 “|650.5|645.0|0.5|” “2023-10-23|悬挂|”
348263107 “|0.0|0.0|0.5|” “2023-09-08|液化气|”

这个想法是它看起来像这样:

ID_CLIENTE 萨尔多 资本 兴趣 FECHA_CORTE 状态
340658221 1540.0 1500.0 40.0 2023-10-23 帕加多
342103146 650.5 645.0 5.5 2023-10-23 吊坠
348263107 0.0 0.0 0.5 2023-09-08 液化酒

我尝试了如下新技巧:

# Columns declaration
columnas = ['ID_CLIENTE,SALDO,CAPITAL,INTERES,FECHA_CORTE,ESTATUS']

# CSV load and dataframe creation
datos_tmp = pd.read_csv(archivo_txt, names=columnas, encoding='latin-1', sep="\t", header=None)

# Convert object column to str
datos_tmp['ID_CLIENTE,SALDO,CAPITAL,INTERES,FECHA_CORTE,ESTATUS'].astype(str)

# Replace commas by spaces
datos_tmp= datos_tmp.apply(lambda x: x.str.replace(",", " "))

# Remove "|" from the end of the line
for fila in datos_tmp:
    valores = fila.rstrip("|")

# Replace "|" by commas
datos_tmp = datos_tmp.apply(lambda x: x.str.replace("|", ","))

# CSV file path mapping for result
datos_tmp_nuevo = r'fuente_tmp_nuevo.csv'

# Convert dataframe to CSV
datos_tmp.to_csv(datos_tmp_nuevo, sep=",", index=False)

# Loading the generated file, now with commas, to create the final dataframe
datos_saldo = pd.read_csv(datos_tmp_nuevo , encoding='latin-1', sep=",")

# Verifying dataframe dimensions
datos_saldo.shape
#The result is 4, 1, instead of 4, 6

# Column separation using commas
datos_saldo = pd.DataFrame([x.split(",") for x in datos_tmp])

# New verifying dataframe dimensions
datos_saldo.shape
#Now the result is 2, 1

我该如何解决这个问题?

python pandas dataframe split
1个回答
0
投票

根据您的示例,以下内容正确无误:

import pandas as pd
import io

data = io.StringIO('''\
340658221|1540.0|1500.0|40.0|2023-10-23|PAGADO|
342103146|650.5|645.0|0.5|2023-10-23|PENDIENTE|
348263107|0.0|0.0|0.5|2023-09-08|LIQUIDADO|
''')

names='ID_CLIENTE,SALDO,CAPITAL,INTERES,FECHA_CORTE,ESTATUS'.split(',')
datos_tmp = pd.read_csv(data, sep='|', names=names, index_col=False)
print(datos_tmp)

输出:

   ID_CLIENTE   SALDO  CAPITAL  INTERES FECHA_CORTE    ESTATUS
0   340658221  1540.0   1500.0     40.0  2023-10-23     PAGADO
1   342103146   650.5    645.0      0.5  2023-10-23  PENDIENTE
2   348263107     0.0      0.0      0.5  2023-09-08  LIQUIDADO
© www.soinside.com 2019 - 2024. All rights reserved.