我有一个 excel 文件,其中前 3 行有标题名称,我想在 pandas 中读取它但在多索引标题中遇到困难。
PLAN 2023
Traffic per channel Traffic Share per Channel
month week All Traffic red green orange red green orange
jan 1 100 50 30 20 50% 30% 20%
对于“月”和“周”,我将标题名称存储在第 3 行中,但对于其他人,它分布在第 1、2、3 行中。此外,行号不固定,因此,我需要按标题阅读。
最终的预期输出应该是这样的
month week plan_2023_Traffic_per_channel_All .....plan_2023_Traffic_Share_per_channel_orange
jan 1 100 20%
我的脚本在下面,为简单起见,我只打印 1 个值
import pandas as pd
# Load the Excel file
df = pd.read_excel('test_3.xlsx', sheet_name='WEEK - 2023', header=None)
# Set the first 3 rows as the header
header = df.iloc[:3,:].fillna(method='ffill', axis=1)
df.columns = pd.MultiIndex.from_arrays(header.values)
df = df.iloc[3:,:]
# Select only the specified columns
df = df.loc[:, ('month', 'week', ('PLAN 2023', 'Traffic per channel', 'red'))]
# Rename the columns to remove the multi-level header
df.columns = ['month', 'week', 'P_traffic_red']
# Print the final data frame
print(df)
图片参考
提前谢谢你
你可以试试:
df = pd.read_excel('test_3.xlsx', header=None)
cols = df.iloc[:3].ffill(axis=1).apply(lambda x: '_'.join(x.dropna()))
df = df.iloc[3:].set_axis(cols, axis=1)
输出:
>>> df
statMonthName statWeek Plan 2023_Traffic per channel_All Traffic ... Plan 2023_Traffic Share per Chanel_red Plan 2023_Traffic Share per Chanel_green Plan 2023_Traffic Share per Chanel_orange
3 jan 1 100 ... 50% 30% 20%
[1 rows x 9 columns]
>>> df.columns
Index(['statMonthName', 'statWeek',
'Plan 2023_Traffic per channel_All Traffic',
'Plan 2023_Traffic per channel_red',
'Plan 2023_Traffic per channel_green',
'Plan 2023_Traffic per channel_orange',
'Plan 2023_Traffic Share per Chanel_red',
'Plan 2023_Traffic Share per Chanel_green',
'Plan 2023_Traffic Share per Chanel_orange'],
dtype='object')