这是另一个IQR异常值问题。我有一个看起来像这样的数据框:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
df
我想找到并删除每种情况的异常值(例如,春季安慰剂,春季药物等)。不是整行,而是单元格。并希望针对每个“红色”,“黄色”,“绿色”列执行此操作。
是否有办法做到,而又不将数据框分解为一堆又一堆的子数据框,而所有条件都单独分解?我不确定如果将“季节”和“处理”作为列或索引来处理,是否会更容易。无论哪种方式我都很好。
我已经尝试过使用.iloc和.loc进行一些操作,但似乎无法使其正常工作。
如果需要通过缺失值替换离群值,请使用:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
g = df.groupby(['Season','Treatment'])
df1 = g.transform('quantile', 0.05)
df2 = g.transform('quantile', 0.95)
c = df.columns.difference(['Season','Treatment'])
mask = df[c].lt(df1) | df[c].gt(df2)
df[c] = df[c].mask(mask)
print (df)
Season Treatment red yellow green
0 Spring Placebo NaN NaN 67.0
1 Spring Placebo 67.0 91.0 3.0
2 Spring Placebo 71.0 56.0 29.0
3 Spring Placebo 48.0 32.0 24.0
4 Spring Placebo 74.0 9.0 51.0
.. ... ... ... ... ...
95 Fall Drug 90.0 35.0 55.0
96 Fall Drug 40.0 55.0 90.0
97 Fall Drug NaN 54.0 NaN
98 Fall Drug 28.0 50.0 74.0
99 Fall Drug NaN 73.0 11.0
[100 rows x 5 columns]