识别 python 数据框中的异常值

Question

我试图在我的聚类模型中识别具有标准差的异常值。

# calculate summary statistics
rfm_mean, rfm_std = mean(rfm), std(rfm)

# identify outliers
cut_off = rfm_std * 3
lower, upper = rfm_mean - cut_off, rfm_mean + cut_off

# identify outliers
outliers = [x for x in rfm if x < lower or x > upper]
print('Identified outliers: %d' % len(outliers))

不知道为什么我会收到此回溯错误；

Invalid comparison between dtype=float64 and str

对此的任何帮助将不胜感激。

提前感谢您的支持！

Answer 1

您无法比较 float64 和字符串。这可能发生在这里：

outliers = [x for x in rfm if x < lower or x > upper]

使用 DataFrame.astype(dtype, copy=True, error='raise') 在使用比较运算符之前转换为正确的类型

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html

Answer 2

我不确定这是否对您有帮助，但希望您能从中有所收获。

我在尝试查找我正在调查的 NFL 数据帧的特定列中的异常值时遇到了很多问题，因此我将这段代码（在我们的一位编程讲师期间提供）放入一个简单的函数中，以便与循环一起使用来识别它们。

def outliers (df,col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    points_outliers=df[(df[col] < Q1 - 1.5 * IQR) | (df[col]  > Q3 +1.5 * IQR)]
    return points_outliers

我也对它们的箱线图感兴趣，所以我用以下内容循环遍历各列：

# Put the columns I wanted to review in a list
cols_list = ['PF', 'PA', 'TotalOffYds', 'TotalYdsAllowed', 'NetYds']

# Loop through the list
for col in cols_list:
outlier_data = outliers(NFL_FULL_df, col)   # Use function created earlier
if not outlier_data.empty:
    print(f"Outliers in {col}:")
    print(outlier_data)
    
    # Create a box plot for the column
    plt.figure(figsize=(12, 8))
    sns.boxplot(x=NFL_FULL_df[col])
    plt.title(f"Box Plot of {col}")
    plt.show()

else:
    print(f"No significant outliers found in {col}.")

希望其中的一些内容对您有所帮助。我还遇到了这篇文章，这可能很有用，因为它返回每个列的数据点，其中定义的 Z 分数超过阈值。

识别 python 数据框中的异常值

问题描述投票：0回答：2

2个回答

最新问题

识别 python 数据框中的异常值

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2