考虑以下多索引
pd.DataFrame
,它有许多缺失值。
import numpy as np
import pandas as pd
# Create multi-index
index = pd.MultiIndex.from_tuples(
[
("A", "X", "I"),
("A", "X", "I"),
("A", "Y", "I"),
("A", "Y", "II"),
("A", "Y", "I"),
],
names=["level_1", "level_2", "level_3"],
)
# Create dataframe
data = [[1, np.nan], [np.nan, 1], [np.nan, 1], [np.nan, 1], [1, np.nan]]
df = pd.DataFrame(data, index=index, columns=["column1", "column2"])
print(df)
column1 column2
level_1 level_2 level_3
A X I 1.0 NaN
I NaN 1.0
Y I NaN 1.0
II NaN 1.0
I 1.0 NaN
怎样才能尽可能地挤压行数?我正在寻找以下结果:
column1 column2
level_1 level_2 level_3
A X I 1.0 1.0
Y I 1.0 1.0
II NaN 1.0
如果可能的话,每个索引的聚合值,例如
mean
:
df = df.groupby(level=[0,1,2]).mean()
print(df)
column1 column2
level_1 level_2 level_3
A X I 1.0 1.0
Y I 1.0 1.0
II NaN 1.0
如果想避免聚集:
f = lambda x: x.apply(lambda x: x.sort_values(key=lambda z: z.isna()))
df = df.groupby(level=[0,1,2], group_keys=False).apply(f).dropna(how='all')
print(df)
column1 column2
level_1 level_2 level_3
A X I 1.0 1.0
Y I 1.0 1.0
II NaN 1.0