Pandas 有条件的第二最小值

Question

在计算第二个最小值时，如果所选列中的值大于“Col7”列中的值，如何将每个项目的第二个最小值添加到 df 中？

import pandas as pd
my_dict={'Item1':['Col1','Col3','Col6'],
'Item2':['Col2','Col4','Col6','Col8'],
'Item3':['Col1','Col3','Col6']
        }
df=pd.DataFrame({
            'Col0':['Item1','Item2','Item3'],
            'Col1':[20,25,28],
            'Col2':[89,15,35],
            'Col3':[36,30,96],
            'Col4':[40,108,13],
            'Col5':[55,2,9],
            'Col6':[35,38,27],
            'Col7':[30,20,39],
            })

结果应该是：

df=pd.DataFrame({
            'Col0':['Item1','Item2','Item3'],
            'Col1':[20,25,28],
            'Col2':[89,15,35],
            'Col3':[36,30,96],
            'Col4':[40,108,13],
            'Col5':[55,2,9],
            'Col6':[35,38,27],
            'Col7':[30,20,39],
            'second min':[36,108,'NaN']
            })

Answer 1

您可以通过迭代字典项，根据字典值选择列，应用条件过滤掉大于“Col7”中的值，然后找到每行的第二个最小值来实现此目的：

import pandas as pd
import numpy as np

my_dict = {
    'Item1': ['Col1', 'Col3', 'Col6'],
    'Item2': ['Col2', 'Col4', 'Col6', 'Col8'],
    'Item3': ['Col1', 'Col3', 'Col6']
}

df = pd.DataFrame({
    'Col0': ['Item1', 'Item2', 'Item3'],
    'Col1': [20, 25, 28],
    'Col2': [89, 15, 35],
    'Col3': [36, 30, 96],
    'Col4': [40, 108, 13],
    'Col5': [55, 2, 9],
    'Col6': [35, 38, 27],
    'Col7': [30, 20, 39],
})

second_min_values = []

for item, cols in my_dict.items():
    selected_cols = [col for col in cols if col in df.columns and col != 'Col7']
    selected_values = df.loc[df['Col0'] == item, selected_cols].values.flatten()
    selected_values = [val for val in selected_values if val > df.loc[df['Col0'] == item, 'Col7'].values[0]]
    if len(selected_values) < 2:
        second_min_values.append('NaN')
    else:
        second_min_values.append(np.partition(selected_values, 1)[1])

df['second min'] = second_min_values

print(df)

Answer 2

我将使用自定义函数在

groupby.apply

中使用，并使用

numpy.partition

获得第二大值：

def get_nth(g, N=2):
    tmp = g.reindex(columns=my_dict.get(g.name))
    return pd.Series(np.partition(tmp.where(tmp.ge(g['Col7'], axis=0)),
                                  N-1, axis=1)[:, N-1], index=g.index)

df['second min'] = (df.groupby('Col0', group_keys=False)
                      .apply(get_nth, include_groups=False)
                    )

输出：

    Col0  Col1  Col2  Col3  Col4  Col5  Col6  Col7  second min
0  Item1    20    89    36    40    55    35    30        36.0
1  Item2    25    15    30   108     2    38    20       108.0
2  Item3    28    35    96    13     9    27    39         NaN

Pandas 有条件的第二最小值

问题描述投票：0回答：2

2个回答

最新问题

Pandas 有条件的第二最小值

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2