检测数据框中的异常值

问题描述 投票:0回答:1

我有一个如下所示的数据框:

                     oph_single_positive_outlier  oph_single_negative_outlier  oph_1pos_1neg_outlier  oph_multipos_at_once_outlier
bucket                                                                                                                            
2023-01-01 00:05:00                            1                            1                      1                             1
2023-01-01 00:06:00                            2                            2                     10                            10
2023-01-01 00:07:00                            3                            3                      3                            10
2023-01-01 00:08:00                           10                            2                      2                            10
2023-01-01 00:09:00                            4                            4                      4                             4
2023-01-01 00:10:00                            5                            5                      5                             5

我想知道是否有办法去除异常值。例如,

oph_single_positive_outlier
中的值是 10。 在
oph_single_negative_outlier
中,第四行应该被删除。 在
oph_1pos_1neg_outlier
中,第二行 (10) 和第四行 (2) 应删除。 基本上所有减少或快速增加的行都应该被删除。 后来,我对每一列都有一个图表。

我用这段代码尝试过:

orig_df = pd.DataFrame(data)

# Set the 'bucket' column as the index
orig_df.set_index('bucket', inplace=True)
print("orignal DataFrame")
display(orig_df)

# Remove invalid data
orig_df = orig_df.dropna()
outliers_single_df = pd.DataFrame()

column= "oph_single_negative_outlier"
columns = orig_df.columns
df = orig_df.pop(columns[0]).to_frame()
df_len_at_start = df.shape[0]
while (True):    
    df['DifferencePumping'] = pd.to_timedelta(df[columns[0]].diff(), unit="h")
    df['DifferencePumpingShifted'] = df['DifferencePumping'].shift(-1)
    df['bucket'] = df.index

    df['DifferenceBucket'] = df["bucket"].diff()
    df['DifferenceBucketShifted'] = df['DifferenceBucket'].shift(-1)
    
    df["decreasing"] = df["DifferencePumping"] < pd.Timedelta(0)
    df["rapid_increase"] = df["DifferenceBucketShifted"] < df["DifferencePumpingShifted"] - pd.Timedelta(seconds=6)

    df["potential_outliers"] = df.decreasing | df.rapid_increase   

    try:
        outliers_single_df = pd.concat([outliers_single_df, df.loc[df.potential_outliers].head(1)])
        df = df.drop(df.loc[df.potential_outliers].index[0])
        
    except IndexError as index_error:
        df = df.drop(columns=['DifferencePumping', 'DifferencePumpingShifted', "bucket", "DifferenceBucket", "DifferenceBucketShifted", "decreasing", "rapid_increase", "potential_outliers"])

        print("Filtered all outliers.")
        break

但这就是我得到的输出:

Filtered all outliers.
                     oph_single_positive_outlier
bucket                                          
2023-01-01 00:08:00                           10

Removed outliers:
                     oph_single_positive_outlier DifferencePumping DifferencePumpingShifted              bucket DifferenceBucket DifferenceBucketShifted  decreasing  rapid_increase  potential_outliers
bucket                                                                                                                                                                                                  
2023-01-01 00:05:00                            1               NaT          0 days 01:00:00 2023-01-01 00:05:00              NaT         0 days 00:01:00       False            True                True
2023-01-01 00:06:00                            2               NaT          0 days 01:00:00 2023-01-01 00:06:00              NaT         0 days 00:01:00       False            True                True
2023-01-01 00:07:00                            3               NaT          0 days 07:00:00 2023-01-01 00:07:00              NaT         0 days 00:01:00       False            True                True
2023-01-01 00:09:00                            4 -1 days +18:00:00          0 days 01:00:00 2023-01-01 00:09:00  0 days 00:01:00         0 days 00:01:00        True            True                True
2023-01-01 00:10:00                            5 -1 days +19:00:00                      NaT 2023-01-01 00:10:00  0 days 00:02:00                     NaT        True           False                True

Original df
                     oph_single_negative_outlier  oph_1pos_1neg_outlier  oph_multipos_at_once_outlier
bucket                                                                                               
2023-01-01 00:05:00                            1                      1                             1
2023-01-01 00:06:00                            2                     10                            10
2023-01-01 00:07:00                            3                      3                            10
2023-01-01 00:08:00                            2                      2                            10
2023-01-01 00:09:00                            4                      4                             4
2023-01-01 00:10:00                            5                      5                             5

cleaned df
                     oph_single_positive_outlier
bucket                                          
2023-01-01 00:08:00                           10

Deleted 5 rows.
python pandas dataframe numpy outliers
1个回答
0
投票

您想使用快速增加/减少方法与 IQR 方法来消除异常值有什么具体原因吗?我认为这个小数据集的结果是相同的:

df = pd.DataFrame(data)

df['bucket'] = pd.to_datetime(df['bucket'])
df.set_index('bucket', inplace=True)

# IQR method
def remove_outliers_iqr(df, column_name, multiplier=1.5):
    q1 = df[column_name].quantile(0.25)
    q3 = df[column_name].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - multiplier * iqr
    upper_bound = q3 + multiplier * iqr
    return df[(df[column_name] >= lower_bound) & (df[column_name] <= upper_bound)]

columns_to_process = df.columns

for column in columns_to_process:
    df = remove_outliers_iqr(df, column)

print(df)
© www.soinside.com 2019 - 2024. All rights reserved.