我有一个如下所示的数据框:
oph_single_positive_outlier oph_single_negative_outlier oph_1pos_1neg_outlier oph_multipos_at_once_outlier
bucket
2023-01-01 00:05:00 1 1 1 1
2023-01-01 00:06:00 2 2 10 10
2023-01-01 00:07:00 3 3 3 10
2023-01-01 00:08:00 10 2 2 10
2023-01-01 00:09:00 4 4 4 4
2023-01-01 00:10:00 5 5 5 5
我想知道是否有办法去除异常值。例如,
oph_single_positive_outlier
中的值是 10。
在 oph_single_negative_outlier
中,第四行应该被删除。
在 oph_1pos_1neg_outlier
中,第二行 (10) 和第四行 (2) 应删除。
基本上所有减少或快速增加的行都应该被删除。
后来,我对每一列都有一个图表。
我用这段代码尝试过:
orig_df = pd.DataFrame(data)
# Set the 'bucket' column as the index
orig_df.set_index('bucket', inplace=True)
print("orignal DataFrame")
display(orig_df)
# Remove invalid data
orig_df = orig_df.dropna()
outliers_single_df = pd.DataFrame()
column= "oph_single_negative_outlier"
columns = orig_df.columns
df = orig_df.pop(columns[0]).to_frame()
df_len_at_start = df.shape[0]
while (True):
df['DifferencePumping'] = pd.to_timedelta(df[columns[0]].diff(), unit="h")
df['DifferencePumpingShifted'] = df['DifferencePumping'].shift(-1)
df['bucket'] = df.index
df['DifferenceBucket'] = df["bucket"].diff()
df['DifferenceBucketShifted'] = df['DifferenceBucket'].shift(-1)
df["decreasing"] = df["DifferencePumping"] < pd.Timedelta(0)
df["rapid_increase"] = df["DifferenceBucketShifted"] < df["DifferencePumpingShifted"] - pd.Timedelta(seconds=6)
df["potential_outliers"] = df.decreasing | df.rapid_increase
try:
outliers_single_df = pd.concat([outliers_single_df, df.loc[df.potential_outliers].head(1)])
df = df.drop(df.loc[df.potential_outliers].index[0])
except IndexError as index_error:
df = df.drop(columns=['DifferencePumping', 'DifferencePumpingShifted', "bucket", "DifferenceBucket", "DifferenceBucketShifted", "decreasing", "rapid_increase", "potential_outliers"])
print("Filtered all outliers.")
break
但这就是我得到的输出:
Filtered all outliers.
oph_single_positive_outlier
bucket
2023-01-01 00:08:00 10
Removed outliers:
oph_single_positive_outlier DifferencePumping DifferencePumpingShifted bucket DifferenceBucket DifferenceBucketShifted decreasing rapid_increase potential_outliers
bucket
2023-01-01 00:05:00 1 NaT 0 days 01:00:00 2023-01-01 00:05:00 NaT 0 days 00:01:00 False True True
2023-01-01 00:06:00 2 NaT 0 days 01:00:00 2023-01-01 00:06:00 NaT 0 days 00:01:00 False True True
2023-01-01 00:07:00 3 NaT 0 days 07:00:00 2023-01-01 00:07:00 NaT 0 days 00:01:00 False True True
2023-01-01 00:09:00 4 -1 days +18:00:00 0 days 01:00:00 2023-01-01 00:09:00 0 days 00:01:00 0 days 00:01:00 True True True
2023-01-01 00:10:00 5 -1 days +19:00:00 NaT 2023-01-01 00:10:00 0 days 00:02:00 NaT True False True
Original df
oph_single_negative_outlier oph_1pos_1neg_outlier oph_multipos_at_once_outlier
bucket
2023-01-01 00:05:00 1 1 1
2023-01-01 00:06:00 2 10 10
2023-01-01 00:07:00 3 3 10
2023-01-01 00:08:00 2 2 10
2023-01-01 00:09:00 4 4 4
2023-01-01 00:10:00 5 5 5
cleaned df
oph_single_positive_outlier
bucket
2023-01-01 00:08:00 10
Deleted 5 rows.
您想使用快速增加/减少方法与 IQR 方法来消除异常值有什么具体原因吗?我认为这个小数据集的结果是相同的:
df = pd.DataFrame(data)
df['bucket'] = pd.to_datetime(df['bucket'])
df.set_index('bucket', inplace=True)
# IQR method
def remove_outliers_iqr(df, column_name, multiplier=1.5):
q1 = df[column_name].quantile(0.25)
q3 = df[column_name].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - multiplier * iqr
upper_bound = q3 + multiplier * iqr
return df[(df[column_name] >= lower_bound) & (df[column_name] <= upper_bound)]
columns_to_process = df.columns
for column in columns_to_process:
df = remove_outliers_iqr(df, column)
print(df)