quotechar 在 pandas read_fwf 中不起作用

问题描述 投票:0回答:1

我在 pandas 数据框列值中有逗号

,
,如下所示。我已经尝试了 pandas
read_fwf
read_csv
方法提供的所有参数,但似乎没有任何效果。

注意:我已准备好txt.gz文件,但无法使用StringIO

示例输入:

"NTE","","","NANTONG JIACHENG GARMENTS CO., LTD","P",0,1
"LEH","","","WUHAN YIZHAO TRADING CO. , LTD.","P",0,2
"ARN","","","Clinical Diagnostic Solutions, Inc.","P",0,7

我的代码

pd.read_fwf(input_filepath,compression='gzip',sep=',',quoting=csv.QUOTE_ALL,skipinitialspace=True,header=None,nrows=1000, usecols=[0,1,7])

预期输出

      0   1   2                                    3    4   5   6
0   NTE| NaN| NaN| NANTONG JIACHENG GARMENTS CO. LTD |  P|  0|  1
1   LEH| NaN| NaN| WUHAN YIZHAO TRADING CO. LTD.     |  P|  0|  2
2   ARN| NaN| NaN| Clinical Diagnostic Solutions Inc.|  P|  0|  7

添加了示例数据文件 数据文件

我只考虑

usecols=[0,1,7]
进行处理。

我正在寻找 pandas 的方法来解决它,而不是使用正则表达式,因为我的数据中有很多这样的逗号。请帮助我解决这个问题。

python pandas csv
1个回答
0
投票

经过广泛的交谈:

import io
import re
import pandas as pd
import tarfile

#Working on the gunzipped and untarred txt file
f = 'RequestoEAP_20220220_test.txt'
s1 = list()
s2 = list()

re_brackets = re.compile('\((.*)\)')

with open(f, 'r') as f:
    for l in f:
        s1.append(l[:23])
        if (m := re_brackets.search(l)):
            s2.append(m[1])
        else:
            s2.append('')

df1 = pd.read_csv(io.StringIO('\n'.join(s1)), sep=' ', header=None)
df2 = pd.read_csv(io.StringIO('\n'.join(s2)), sep=',', header=None)

df = pd.concat((df1, df2), axis=1)
print(df)

输出:

           0             1           0    1   2   3    4   5   6   ... 24  25  26         27  28   29                                                 30    31    32
0  2022-02-20  00:00:10.061  6016293021  JKT  ID  AP  SUB  ID  AP  ...  N   0   1  FOCIDIGTW NaN  NaN  jms:WebSphere_MQ-default-sender).................  main  True
1  2022-02-20  00:00:10.061  8910112455  BJX  MX  AM  VSA  MX  AM  ...  5   0   1  980446133 NaN  NaN  jms:WebSphere_MQ-default-sender).................  main  True
2  2022-02-20  00:00:10.061  9640651705  NLU  MX  AM  MTY  MX  AM  ...  G   0   1  988122138 NaN  NaN  jms:WebSphere_MQ-default-sender).................  main  True
3  2022-02-20  00:00:10.061  9410678701  JKT  ID  AP  SRG  ID  AP  ...  N   0   1  FOCIDIGTW NaN  NaN  jms:WebSphere_MQ-default-sender).................  main  True
4  2022-02-20  00:00:10.061  7120027014  BOM  IN  AP  CPH  DK  EU  ...  D   0   1  530852429 NaN  NaN  jms:WebSphere_MQ-default-sender).................  main  True
5  2022-02-20  00:00:10.062  9473172225  LCY  GB  EU  ZLS  GB  EU  ...  N   0   1  135104716 NaN  NaN  jms:WebSphere_MQ-default-sender).................  main  True
6  2021-11-09  00:00:00.200  5988409265  PVG  CN  CN  NTE  FR  EU  ...  P   0   1  969762616 NaN  NTE                    jms:WebSphere_MQ-default-sender   NaN   NaN
7  2021-11-09  00:00:00.202  9876963305  SZX  CN  CN  LEH  FR  EU  ...  P   0   2  606734345 NaN  NaN                    jms:WebSphere_MQ-default-sender   NaN   NaN
8  2021-11-09  00:00:00.292  5697446005  TMB  US  AM  ARN  SE  EU  ...  P   0   7  962003035 NaN  STO                    jms:WebSphere_MQ-default-sender   NaN   NaN

在广泛聊天之前:

import io
s = """"NTE","","","NANTONG JIACHENG GARMENTS CO., LTD","P",0,1
"LEH","","","WUHAN YIZHAO TRADING CO. , LTD.","P",0,2
"ARN","","","Clinical Diagnostic Solutions, Inc.","P",0,7
"""
pd.read_csv(io.StringIO(s), sep=',', header=None)

您使用了错误的分隔符并且指定了压缩,这一切似乎都没有必要。 输出:

     0   1   2                                    3  4  5  6
0  NTE NaN NaN   NANTONG JIACHENG GARMENTS CO., LTD  P  0  1
1  LEH NaN NaN      WUHAN YIZHAO TRADING CO. , LTD.  P  0  2
2  ARN NaN NaN  Clinical Diagnostic Solutions, Inc.  P  0  7
© www.soinside.com 2019 - 2024. All rights reserved.