我有一个包含多个表的制表符分隔文本文件。有 13 个已知表,每个表都有一定数量的已知列(每个表的列数不同)。但是,每个表可能有 1 到 10,000 条记录,并且每个表的记录数未知。我正在尝试使用 Python/Pandas 读取该文本文件并创建多个数据框。该文件看起来像我在下面显示的片段。表将始终以 %T 开头,列标题将始终以 %F 开头,记录将始终以 %R 开头。请注意,文件的前几行也始终存在非表格信息,并且不相关文本的行数因文件而异。
Random Text[tab]More Random Text[tab]Even More Random Text
And Still More Random Text[tab]Text[tab]Text
%T[tab]TABLE_1_NAME
%F[tab]COL_1.1[tab]COL_1.2[tab]COL_1.3
%R[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value
%T[tab]TABLE_2_NAME
%F[tab]COL_2.1[tab]COL_2.2[tab]COL_2.3[tab]COL_2.4
%R[tab]Value[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value[tab]Value
%R[tab]Value[tab]Value[tab]Value[tab]Value
...and so on...
%E
有什么想法可以从这个文件中获取多个数据帧吗?
首先,将文件拆分为多个pandas可读的文件。然后加载每个文件:
import itertools
def splitFile(infilepath):
outfileNum = itertools.count(1)
with open(infilepath) as infile:
for line in infile:
if not line.startswith("%T"): continue
outfilepath = f"table_{next(outfileNum)}.tsv"
outfile = open(outfilepath, 'w')
outfile.write(line)
for line in infile:
if line.startswith("%T"):
outfile.close()
outfile = open(f"table_{next(outfileNum)}.tsv", 'w')
outfile.write(line)
现在您有了一堆 tsv 文件,您可以将它们全部加载到数据帧列表中:
import glob
import pandas as pd
dfs = [pd.read_csv(fname, delimiter='\t') for fname in glob.glob("*.tsv")]
我假设
[tab]
是制表符 \t
:
from io import StringIO
import pandas as pd
data = r"""Random Text More Random Text Even More Random Text
And Still More Random Text Text Text
%T TABLE_1_NAME
%F COL_1.1 COL_1.2 COL_1.3
%R Value Value Value
%R Value Value Value
%R Value Value Value
%T TABLE_2_NAME
%F COL_2.1 COL_2.2 COL_2.3 COL_2.4
%R Value Value Value Value
%R Value Value Value Value
%R Value Value Value Value
%R Value Value Value Value
%R Value Value Value Value
%E"""
tables = {}
for line in map(str.strip, data.splitlines()):
if line.startswith("%T"):
current_table = list()
tables[line.split(maxsplit=1)[-1]] = current_table
elif line.startswith(("%F", "%R")):
current_table.append(line.split(maxsplit=1)[-1])
for k, v in tables.items():
df = pd.read_csv(StringIO("\n".join(v)), sep="\t")
print("TABLE =", k)
print(df)
print("-" * 80)
打印:
TABLE = TABLE_1_NAME
COL_1.1 COL_1.2 COL_1.3
0 Value Value Value
1 Value Value Value
2 Value Value Value
--------------------------------------------------------------------------------
TABLE = TABLE_2_NAME
COL_2.1 COL_2.2 COL_2.3 COL_2.4
0 Value Value Value Value
1 Value Value Value Value
2 Value Value Value Value
3 Value Value Value Value
4 Value Value Value Value
--------------------------------------------------------------------------------