我有一个包含一些重复行的数据框(由两列,t1和t2),但我只想为每个重复保留一行,即具有最低值的行,从其他三列计算:n,m和c
import pandas as pd
df = pd.DataFrame({
"t1": [1, 1, 1, 1, 1, 1],
"t2": [1, 2, 2, 3, 4, 4],
"x": [1.01, 0.66, 1.01, 0.45, 0.89, 0.64],
"y": [0.23, 0.31, 0.06, 1.12, 0.70, 0.60],
"z": [0.06, 1.07, 0.12, 0.20, 0.62, 0.68],
"n": [6, 6, 7, 6, 7, 7],
"m": [0.21, 1.19, 0.81, 1.18, 0.28, 0.67],
"c": [64.4, 64.4, 63.2, 65.6, 63.2, 63.2]
})
第 1 行和第 2 行是重复的,第 4 行和第 5 行也是重复的,并且在执行时
w = (12/df['n'])*0.4 + (df['m']/0.35)*0.2 + (df['c']/150)*0.4
对于每个重复项,我想保留最低的行
w
(结果如下)。
我可以使用这段代码删除所需的行,这给了我上面的最终 df 。
# adding a column with temporary values
df['w'] = (12/df['n'])*0.4 + (df['m']/0.35)*0.2 + (df['c']/150)*0.4
# create a df with the duplicated rows
dfd = df[df.duplicated(['t1', 't2'], keep=False) == True]
# initializing a list with rows (indexes) to drop
rows_to_drop = []
# groupby returns a group (g) and df (dfg)
for g, dfg in df.groupby(['t1', 't2']):
# only groups with two or more rows
if len(dfg) > 1:
# get the index of the row with highest w, the one to drop
idx = dfg[dfg['w'] == dfg['w'].max()].index
rows_to_drop.append(idx[0])
# drop the rows
df = df.drop(index=rows_to_drop)
但是感觉代码很麻烦。例如,我添加一个临时列 w,只是为了保存要比较的值。
我希望得到如何改进这一点的建议。
groupby.idxmin
:
out = df.loc[w.groupby([df['t1'], df['t2']]).idxmin()]
输出:
t1 t2 x y z n m c
0 1 1 1.01 0.23 0.06 6 0.21 64.4
2 1 2 1.01 0.06 0.12 7 0.81 63.2
3 1 3 0.45 1.12 0.20 6 1.18 65.6
4 1 4 0.89 0.70 0.62 7 0.28 63.2