将错误的 txt 文件转换为 csv 文件的 Python 代码

问题描述 投票:0回答:1

我正在尝试将带有空格分隔值的错误文本文件转换为干净的 csv 文件。请指导我。

以下是我的数据。 数据尚未与 csv 中的输出列正确匹配。

HP TRA ID        CL ID              IN/EId      No    Loop  Element Name                    Freq  STATUS         Error Severity  Error ID         Message                                                                                                                                                                                                                                                                                                       Report Source

13ZI       20712800032                                                             1     Denied         Error           HP_DOSOlderTh  Date of service is older than 12 months                                                                                                                                                                                                                                                                       HP           
13ZI       20712800032                 1                                           1     Rejected      Error           CA16            Rejected at  level. DupKeyID:0 is a Rejected of DupKeyID:0 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716.                                                                                                                        HP           
13ZI       20712800032                 2                                           1     Rejected      Error           CA16            Rejected at  level. DupKeyID:1 is a Rejected of DupKeyID:1 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716.                                                                                                                        HP           
13ZI       20712800032                 3                                           1     Rejected      Error           CA16            Rejected at  level. DupKeyID:2 is a Rejected of DupKeyID:2 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716.                                                                                                                        HP           
1P8TY0J25       20712805263                                                             1     Denied         Error           HP_DOSOlderTh  Date of service is older than 12 months 

我尝试了下面的代码,但没有成功。

df = pd.read_csv('file1.txt', sep='\t', index_col=False, dtype='object') 
df.to_csv(r'Report.csv', index = None) 

也在线下方。数据尚未与 csv 中的列正确匹配

df = pd.read_csv("file1.txt", sep=r"\s{2,}", engine="python")
df.to_csv(r'Report.csv', index = None) 

我期待这样的输出

output

python pandas parsing export-to-csv
1个回答
0
投票

是的,这是一个错误的txt 文件。我不希望任何内置函数能够处理它。您可能能够编写一些自定义代码以将其转换为正确的 CSV 文件。例如,这里有一些适用于示例输入(非常具体)的代码,您可以进行修改以适用于真实文件。但也有可能该文件基本上是不明确的,您必须使用您对数据含义的了解来手动调整任何转换结果。

import sys, re, pprint

table = []
for (i, line) in enumerate(open('input.txt').readlines()):
    if line[-1] == '\n': line = line[:-1]

    if i == 0:
        # header
        # In the header, we can assume that no 'cell' would be empty,
        # so we can just split on runs of 2 or more spaces.
        row = re.split(r' {2,}', line)
        table.append(row)
        continue

    if i == 1:
        assert line == ''
        # blank line between header and data
        continue

    # For all other lines, we have to look at runs of 2+ spaces
    # and decide what they 'mean'.

    def replfunc(mo):
        L = len(mo.group(0))

        # Some 'Message' values say:
        # "Rejected at  level. DupKeyID..."
        # i.e., there's a run of 2 spaces *within* a cell value.
        # Deal with this particular case.
        if L == 2:
            (start, end) = mo.span()
            if (
                line[:start].endswith('Rejected at')
                and
                line[end:].startswith('level.')
            ):
                # Replace it with a single space.
                return ' '

        # Otherwise, this run of spaces is equivalent to
        # one or more field-separators.
        # We'll replace it with tabs and then split on tabs.

        if L < 2:
            assert 0
        elif 2 <= L <= 12:
            return '\t'
        elif L == 17:
            return '\t\t'
        elif L == 43:
            return '\t\t\t'
        elif L == 61:
            return '\t\t\t\t\t'
        elif L == 120:
            return '\t'
        elif L == 263:
            return '\t'
        else:
            return f'<{L}>'

    tabbed_line = re.sub(r'\s{2,}', replfunc, line)
    row = tabbed_line.split('\t')
    table.append(row)

# ---------------------------

# The rest is just to display the resulting table nicely.

max_n_fields = max(
    len(row)
    for row in table
)

field_widths = []
for j in range(max_n_fields):
    field_width = max(
        len(row[j])
        for row in table
        if j < len(row)
    )
    field_widths.append(field_width)

for (i, row) in enumerate(table):
    for (field, field_width) in zip(row, field_widths):
        print(field.ljust(field_width), end='|')
    print()

这是输出:

HP TRA ID|CL ID      |IN/EId|No|Loop|Element Name|Freq|STATUS  |Error Severity|Error ID     |Message                                                                                                                                          |Report Source|
13ZI     |20712800032|      |  |    |            |1   |Denied  |Error         |HP_DOSOlderTh|Date of service is older than 12 months                                                                                                          |HP           ||
13ZI     |20712800032|      |1 |    |            |1   |Rejected|Error         |CA16         |Rejected at level. DupKeyID:0 is a Rejected of DupKeyID:0 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716.|HP           ||
13ZI     |20712800032|      |2 |    |            |1   |Rejected|Error         |CA16         |Rejected at level. DupKeyID:1 is a Rejected of DupKeyID:1 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716.|HP           ||
13ZI     |20712800032|      |3 |    |            |1   |Rejected|Error         |CA16         |Rejected at level. DupKeyID:2 is a Rejected of DupKeyID:2 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716.|HP           ||
1P8TY0J25|20712805263|      |  |    |            |1   |Denied  |Error         |HP_DOSOlderTh|Date of service is older than 12 months                                                                                                          |
© www.soinside.com 2019 - 2024. All rights reserved.