我在 pandas 数据框列值中有逗号
,
,如下所示。我已经尝试了 pandas read_fwf
和 read_csv
方法提供的所有参数,但似乎没有任何效果。
注意:我已准备好txt.gz文件,但无法使用StringIO
示例输入:
"NTE","","","NANTONG JIACHENG GARMENTS CO., LTD","P",0,1
"LEH","","","WUHAN YIZHAO TRADING CO. , LTD.","P",0,2
"ARN","","","Clinical Diagnostic Solutions, Inc.","P",0,7
我的代码
pd.read_fwf(input_filepath,compression='gzip',sep=',',quoting=csv.QUOTE_ALL,skipinitialspace=True,header=None,nrows=1000, usecols=[0,1,7])
预期输出
0 1 2 3 4 5 6
0 NTE| NaN| NaN| NANTONG JIACHENG GARMENTS CO. LTD | P| 0| 1
1 LEH| NaN| NaN| WUHAN YIZHAO TRADING CO. LTD. | P| 0| 2
2 ARN| NaN| NaN| Clinical Diagnostic Solutions Inc.| P| 0| 7
添加了示例数据文件 数据文件
我只考虑
usecols=[0,1,7]
进行处理。
我正在寻找 pandas 的方法来解决它,而不是使用正则表达式,因为我的数据中有很多这样的逗号。请帮助我解决这个问题。
经过广泛的交谈:
import io
import re
import pandas as pd
import tarfile
#Working on the gunzipped and untarred txt file
f = 'RequestoEAP_20220220_test.txt'
s1 = list()
s2 = list()
re_brackets = re.compile('\((.*)\)')
with open(f, 'r') as f:
for l in f:
s1.append(l[:23])
if (m := re_brackets.search(l)):
s2.append(m[1])
else:
s2.append('')
df1 = pd.read_csv(io.StringIO('\n'.join(s1)), sep=' ', header=None)
df2 = pd.read_csv(io.StringIO('\n'.join(s2)), sep=',', header=None)
df = pd.concat((df1, df2), axis=1)
print(df)
输出:
0 1 0 1 2 3 4 5 6 ... 24 25 26 27 28 29 30 31 32
0 2022-02-20 00:00:10.061 6016293021 JKT ID AP SUB ID AP ... N 0 1 FOCIDIGTW NaN NaN jms:WebSphere_MQ-default-sender)................. main True
1 2022-02-20 00:00:10.061 8910112455 BJX MX AM VSA MX AM ... 5 0 1 980446133 NaN NaN jms:WebSphere_MQ-default-sender)................. main True
2 2022-02-20 00:00:10.061 9640651705 NLU MX AM MTY MX AM ... G 0 1 988122138 NaN NaN jms:WebSphere_MQ-default-sender)................. main True
3 2022-02-20 00:00:10.061 9410678701 JKT ID AP SRG ID AP ... N 0 1 FOCIDIGTW NaN NaN jms:WebSphere_MQ-default-sender)................. main True
4 2022-02-20 00:00:10.061 7120027014 BOM IN AP CPH DK EU ... D 0 1 530852429 NaN NaN jms:WebSphere_MQ-default-sender)................. main True
5 2022-02-20 00:00:10.062 9473172225 LCY GB EU ZLS GB EU ... N 0 1 135104716 NaN NaN jms:WebSphere_MQ-default-sender)................. main True
6 2021-11-09 00:00:00.200 5988409265 PVG CN CN NTE FR EU ... P 0 1 969762616 NaN NTE jms:WebSphere_MQ-default-sender NaN NaN
7 2021-11-09 00:00:00.202 9876963305 SZX CN CN LEH FR EU ... P 0 2 606734345 NaN NaN jms:WebSphere_MQ-default-sender NaN NaN
8 2021-11-09 00:00:00.292 5697446005 TMB US AM ARN SE EU ... P 0 7 962003035 NaN STO jms:WebSphere_MQ-default-sender NaN NaN
在广泛聊天之前:
import io
s = """"NTE","","","NANTONG JIACHENG GARMENTS CO., LTD","P",0,1
"LEH","","","WUHAN YIZHAO TRADING CO. , LTD.","P",0,2
"ARN","","","Clinical Diagnostic Solutions, Inc.","P",0,7
"""
pd.read_csv(io.StringIO(s), sep=',', header=None)
您使用了错误的分隔符并且指定了压缩,这一切似乎都没有必要。 输出:
0 1 2 3 4 5 6
0 NTE NaN NaN NANTONG JIACHENG GARMENTS CO., LTD P 0 1
1 LEH NaN NaN WUHAN YIZHAO TRADING CO. , LTD. P 0 2
2 ARN NaN NaN Clinical Diagnostic Solutions, Inc. P 0 7