基于数据框中的列变量或多索引删除离群值

问题描述 投票:0回答:1

这是另一个IQR异常值问题。我有一个看起来像这样的数据框:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
df

我想找到并删除每种情况的异常值(例如,春季安慰剂,春季药物等)。不是整行,而是单元格。并希望针对每个“红色”,“黄色”,“绿色”列执行此操作。

是否有办法做到,而又不将数据框分解为一堆又一堆的子数据框,而所有条件都单独分解?我不确定如果将“季节”和“处理”作为列或索引来处理,是否会更容易。无论哪种方式我都很好。

我已经尝试过使用.iloc和.loc进行一些操作,但似乎无法使其正常工作。

python-3.x pandas dataframe multi-index outliers
1个回答
0
投票

如果需要通过缺失值替换离群值,请使用:

np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]

g = df.groupby(['Season','Treatment'])
df1 = g.transform('quantile', 0.05)
df2 = g.transform('quantile', 0.95)

c = df.columns.difference(['Season','Treatment'])
mask = df[c].lt(df1) | df[c].gt(df2)
df[c] = df[c].mask(mask)

print (df)
    Season Treatment   red  yellow  green
0   Spring   Placebo   NaN     NaN   67.0
1   Spring   Placebo  67.0    91.0    3.0
2   Spring   Placebo  71.0    56.0   29.0
3   Spring   Placebo  48.0    32.0   24.0
4   Spring   Placebo  74.0     9.0   51.0
..     ...       ...   ...     ...    ...
95    Fall      Drug  90.0    35.0   55.0
96    Fall      Drug  40.0    55.0   90.0
97    Fall      Drug   NaN    54.0    NaN
98    Fall      Drug  28.0    50.0   74.0
99    Fall      Drug   NaN    73.0   11.0

[100 rows x 5 columns]
© www.soinside.com 2019 - 2024. All rights reserved.