根据条件从数据框中删除重复行

问题描述 投票:0回答:1

我有一个包含一些重复行的数据框(由两列,t1和t2),但我只想为每个重复保留一行,即具有最低值的行,从其他三列计算:n,m和c

import pandas as pd
df = pd.DataFrame({
   "t1": [1, 1, 1, 1, 1, 1],
   "t2": [1, 2, 2, 3, 4, 4],
   "x": [1.01, 0.66, 1.01, 0.45, 0.89, 0.64],
   "y": [0.23, 0.31, 0.06, 1.12, 0.70, 0.60],
   "z": [0.06, 1.07, 0.12, 0.20, 0.62, 0.68],
   "n": [6, 6, 7, 6, 7, 7],
   "m": [0.21, 1.19, 0.81, 1.18, 0.28, 0.67],
   "c": [64.4, 64.4, 63.2, 65.6, 63.2, 63.2]
})

第 1 行和第 2 行是重复的,第 4 行和第 5 行也是重复的,并且在执行时

w = (12/df['n'])*0.4 + (df['m']/0.35)*0.2 + (df['c']/150)*0.4

对于每个重复项,我想保留最低的行

w
(结果如下)。

我可以使用这段代码删除所需的行,这给了我上面的最终 df 。

# adding a column with temporary values
df['w'] = (12/df['n'])*0.4 + (df['m']/0.35)*0.2 + (df['c']/150)*0.4

# create a df with the duplicated rows
dfd = df[df.duplicated(['t1', 't2'], keep=False) == True]

# initializing a list with rows (indexes) to drop
rows_to_drop = []

# groupby returns a group (g) and df (dfg)
for g, dfg in df.groupby(['t1', 't2']):
    # only groups with two or more rows
    if len(dfg) > 1:
        # get the index of the row with highest w, the one to drop
        idx = dfg[dfg['w'] == dfg['w'].max()].index
        rows_to_drop.append(idx[0])

# drop the rows
df = df.drop(index=rows_to_drop)

但是感觉代码很麻烦。例如,我添加一个临时列 w,只是为了保存要比较的值。

我希望得到如何改进这一点的建议。

python pandas dataframe duplicates
1个回答
0
投票

您可以使用

groupby.idxmin

out = df.loc[w.groupby([df['t1'], df['t2']]).idxmin()]

输出:

   t1  t2     x     y     z  n     m     c
0   1   1  1.01  0.23  0.06  6  0.21  64.4
2   1   2  1.01  0.06  0.12  7  0.81  63.2
3   1   3  0.45  1.12  0.20  6  1.18  65.6
4   1   4  0.89  0.70  0.62  7  0.28  63.2
© www.soinside.com 2019 - 2024. All rights reserved.