为什么用 np.nan 替换异常值会删除该列中的所有非零数据？

Question

我正在尝试用 .nan 值（而不是整个列）替换异常值，以便我可以用中位数填充所有 .nan 值。到目前为止我已经尝试了两种方法：

import numpy as np
import pandas as pd

def outlier_thresholds(dataframe, col_name, low_q = 0.05, up_q= 0.95):
    q1, q3 = np.nanpercentile(dataframe[col_name], [low_q, up_q])
    iqr = q3 - q1
    lower_thres = q1 - (1.5 * iqr)
    upper_thres = q3 + (1.5 * iqr)
    return lower_thres, upper_thres

def check_outlier(dataframe, col_name):
    lower_thres, upper_thres = outlier_thresholds(dataframe, col_name)
    if dataframe[(dataframe[col_name] > upper_thres) | 
                 (dataframe[col_name] < lower_thres)].any(axis=None):
        return True
    return False

outlier_list = []
for col in num_list:
    if check_outlier(df, col):
        outlier_list.append(col)

def remove_outlier(dataframe, col_name):
    lower_thres, upper_thres = outlier_thresholds(dataframe, col_name)
    df_without = dataframe[~((dataframe[col_name] < lower_thres) | (dataframe[col_name] > upper_thres))]
    return df_without

for col in outlier_list:
    new_df = remove_outlier(df, col)

remove_outlier函数返回一个已删除异常值的新数据集。从图中我可以看到列内仍然有值。然而，我没有写这个代码，我的代码不起作用，我想了解为什么。

drop_outliers函数似乎将该列中的所有数据替换为 nan 值，因此我有一个空的图形/列，因为剩余的所有数据都具有 nan 或 0 作为值。

def drop_outliers(dataframe, col_name):
    lower_thres, upper_thres = outlier_thresholds(dataframe, col_name)
    dataframe.loc[(dataframe[col_name] < lower_thres) | (dataframe[col_name] > upper_thres), dataframe[col_name]] = np.nan

for col in outlier_list:
    drop_outliers(df, col)

我已经检查了有关此问题的重复项，因此我知道我可以使用其他路径来解决此问题，但我仍然没有找出代码的问题。我可以获得一些帮助吗？

编辑：示例数据：

sample_df = {0: 0.0,
 1: 0.0,
 2: 0.0,
 3: 0.0,
 4: 0.0,
 5: 0.0,
 6: 0.0,
 7: 32.0,
 8: 0.0,
 9: 0.0,
 10: 0.0,
 11: 0.0,
 12: 0.0,
 13: 0.0,
 14: 0.0,
 15: 0.0,
 16: 0.0,
 17: 0.0,
 18: 0.0,
 19: 0.0,
 20: 0.0,
 21: 0.0,
 22: 0.0,
 23: 0.0,
 24: 668.0,
 25: 0.0,
 26: 486.0,
 27: 0.0,
 28: 0.0,
 29: 0.0}

Answer 1

您使用的是 NumPy 期望百分位数的分位数。

np.nanpercentile()

的 NumPy 文档是这样说的：

q：类似数组的浮点数

要计算的百分位数或百分位数序列，必须介于 0 到 100 之间（包含 0 和 100）。

如果您使用值 0.05 和 0.95，则 NumPy 会计算 0.05% 和 0.95% 百分位数。由于超过 1% 的值为零，这意味着 q1 和 q3 为零。那么，这意味着你的 IQR 也为零。

您可以使用

low_q = 5, up_q= 95

或使用

np.nanquantile()

来解决此问题，后者使用 0 到 1 之间的分位数。

为什么用 np.nan 替换异常值会删除该列中的所有非零数据？

问题描述投票：0回答：1

1个回答

最新问题

为什么用 np.nan 替换异常值会删除该列中的所有非零数据？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1