我根据第一行(包含索引)和第一列(包含接收索引的日期及其类型)的标题从 xlsx 文件中读取数据。在屏幕截图中,您可以看到数据组织的性质:
我想出了如何制作 pandas DataFrame 来读取一个索引。结果是这种类型的 DataFrame:
我不知道如何一次正确读取所有索引,例如使用循环或更好的列表理解。
在这里,我提出了我的解决方案 - 它部分有效,但我无法理解如何正确迭代
f'{index_names[1]}_{val}
,以便它适用于所有索引,而不仅仅是一个索引。
我也无法弄清楚如何转换 sheet['C' + str(item)]
条目,以便它迭代所有索引,而不仅仅是一个。
characteristic = [100, 200, 300, 400, 500, 600, 700]
index_names = [sheet[1][row].value for row in range(1,sheet.max_row) if sheet[1][row].value != None]
index_list = [pd.DataFrame(
{f'{index_names[1]}_{val}': [sheet['C' + str(item)].value for item in range(1,26)
if sheet['A' + str(item)].value == val]
for val in characteristic},
index = ['April 12', 'April 20', 'April 29']
) for _ in range(39)]
也许我的代码看起来很麻烦,可以简化一下
UPD:如果我们添加
index_list[0].to_dict('tight')
那么结果将如下:
{'index': ['April 12', 'April 20', 'April 29'],
'columns': ['Second_index_100',
'Second_index_200',
'Second_index_300',
'Second_index_400',
'Second_index_500',
'Second_index_600',
'Second_index_700'],
'data': [[0.43927605317127927,
-0.24029588928209195,
0.26450969805682434,
0.18810770500537646,
0.26586690176009525,
0.21631310872586834,
0.32927840726651636],
[0.16442875037513777,
0.12442062805633937,
0.06353459713174614,
0.14329091121735923,
0.17469551024592245,
0.20938555077590043,
0.17154589574351475],
[0.4615041268976439,
0.6488484892496023,
0.28007883537118355,
0.5962923255606478,
0.5924116517116391,
0.559117121673802,
0.6458160644845848]],
'index_names': [None],
'column_names': [None]}
假设导入后有这样的输入(带有
df = pd.read_excel('input.xlsx', index_col=0)
):
First_index Second_index Third_index
April 12 NaN NaN NaN
100 1.0 2.0 3.0
200 4.0 5.0 6.0
300 7.0 8.0 9.0
400 10.0 11.0 12.0
500 13.0 14.0 15.0
600 16.0 17.0 18.0
700 19.0 20.0 21.0
April 20 NaN NaN NaN
100 22.0 23.0 24.0
200 25.0 26.0 27.0
300 28.0 29.0 30.0
400 31.0 32.0 33.0
500 34.0 35.0 36.0
600 37.0 38.0 39.0
700 40.0 41.0 42.0
pivot
:
# move index back to column (only if not already a column)
# if already a column, use its name in the following code
# instead of "index"
tmp = df.reset_index()
# identify rows that we be pivoted
# you could also use pd.to_numeric/pd.to_datetime on the "index"
m = tmp['First_index'].isna()
# reshape
out = (tmp[~m].assign(idx=tmp['index'].where(m).ffill())
.pivot(index='idx', columns='index')
.rename_axis(None)
)
# flatten the column MultiIndex
out.columns = out.columns.map(lambda x: f'{x[0]}_{x[1]}')
输出:
First_index_100 First_index_200 First_index_300 First_index_400 First_index_500 First_index_600 First_index_700 Second_index_100 Second_index_200 Second_index_300 Second_index_400 Second_index_500 Second_index_600 Second_index_700 Third_index_100 Third_index_200 Third_index_300 Third_index_400 Third_index_500 Third_index_600 Third_index_700
April 12 1.0 4.0 7.0 10.0 13.0 16.0 19.0 2.0 5.0 8.0 11.0 14.0 17.0 20.0 3.0 6.0 9.0 12.0 15.0 18.0 21.0
April 20 22.0 25.0 28.0 31.0 34.0 37.0 40.0 23.0 26.0 29.0 32.0 35.0 38.0 41.0 24.0 27.0 30.0 33.0 36.0 39.0 42.0