读取具有多个表的制表符分隔文本文件,Python/Pandas

问题描述 投票:0回答:2

我有一个包含多个表的制表符分隔文本文件。有 13 个已知表,每个表都有一定数量的已知列(每个表的列数不同)。但是,每个表可能有 1 到 10,000 条记录,并且每个表的记录数未知。我正在尝试使用 Python/Pandas 读取该文本文件并创建多个数据框。该文件看起来像我在下面显示的片段。表将始终以 %T 开头,列标题将始终以 %F 开头,记录将始终以 %R 开头。请注意,文件的前几行也始终存在非表格信息,并且不相关文本的行数因文件而异。

Random Text[tab]More Random Text[tab]Even More Random Text
And Still More Random Text[tab]Text[tab]Text
%T[tab]TABLE_1_NAME
%F[tab]COL_1.1[tab]COL_1.2[tab]COL_1.3
%R[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value
%T[tab]TABLE_2_NAME
%F[tab]COL_2.1[tab]COL_2.2[tab]COL_2.3[tab]COL_2.4
%R[tab]Value[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value[tab]Value
...and so on...
%E

有什么想法可以从这个文件中获取多个数据帧吗?

python pandas csv multiple-tables
2个回答
0
投票

首先,将文件拆分为多个pandas可读的文件。然后加载每个文件:

import itertools

def splitFile(infilepath):
    outfileNum = itertools.count(1)
    with open(infilepath) as infile:
        for line in infile:
            if not line.startswith("%T"): continue
        outfilepath = f"table_{next(outfileNum)}.tsv"
        outfile = open(outfilepath, 'w')
        outfile.write(line)
        for line in infile:
            if line.startswith("%T"):
                outfile.close()
                outfile = open(f"table_{next(outfileNum)}.tsv", 'w')

            outfile.write(line)

现在您有了一堆 tsv 文件,您可以将它们全部加载到数据帧列表中:

import glob
import pandas as pd

dfs = [pd.read_csv(fname, delimiter='\t') for fname in glob.glob("*.tsv")]


0
投票

我假设

[tab]
是制表符
\t
:

from io import StringIO

import pandas as pd

data = r"""Random Text More Random Text    Even More Random Text
And Still More Random Text  Text Text
%T  TABLE_1_NAME
%F  COL_1.1  COL_1.2  COL_1.3
%R  Value    Value  Value
%R  Value    Value  Value
%R  Value    Value  Value
%T  TABLE_2_NAME
%F  COL_2.1  COL_2.2  COL_2.3  COL_2.4
%R  Value    Value  Value    Value
%R  Value    Value  Value    Value
%R  Value    Value  Value    Value
%R  Value    Value  Value    Value
%R  Value    Value  Value    Value
%E"""

tables = {}
for line in map(str.strip, data.splitlines()):
    if line.startswith("%T"):
        current_table = list()
        tables[line.split(maxsplit=1)[-1]] = current_table
    elif line.startswith(("%F", "%R")):
        current_table.append(line.split(maxsplit=1)[-1])

for k, v in tables.items():
    df = pd.read_csv(StringIO("\n".join(v)), sep="\t")
    print("TABLE =", k)
    print(df)
    print("-" * 80)

打印:

TABLE = TABLE_1_NAME
  COL_1.1  COL_1.2  COL_1.3
0     Value    Value  Value
1     Value    Value  Value
2     Value    Value  Value
--------------------------------------------------------------------------------
TABLE = TABLE_2_NAME
  COL_2.1  COL_2.2  COL_2.3  COL_2.4
0     Value    Value  Value    Value
1     Value    Value  Value    Value
2     Value    Value  Value    Value
3     Value    Value  Value    Value
4     Value    Value  Value    Value
--------------------------------------------------------------------------------
© www.soinside.com 2019 - 2024. All rights reserved.