考虑不同的异常值检测构建基础，如何区分故障率

Question

假设一家公司生产产品 A。以下分别是一段时间内制造的产品数量和失败的产品数量。我需要找出其中的异常情况吗？

我计算了故障率并想到对其进行异常检测。但如果您在这种情况下看到最高的故障率/百万是 40000，其中构建基础较低并且可能会产生误导。我们该如何解决这个问题？还有哪些其他方法可以解决这个问题？

由于 100 个产品中有 20 个产品失败，而 1000 个产品中有 200 个失败，如果我们计算失败率，我们得到的失败率是相同的 20%。我们如何区分这两者？

Answer 1

虽然对于任何模型来说数据都非常小，但仍想在这里强制使用线性回归模型。你可能会拒绝这个假设，但我正在尝试。

观察到，

product_produced

和

product_failed

具有很强的线性关系。

import statsmodels.api as sm

X = sm.add_constant(df["product_produced"])  
model = sm.OLS(df["product_failed"], X).fit() # fit the regression model upon 
print(model.summary()) # to get the model summary

现在，为了找到数据中的异常月份，我们将计算残差，即

actual

-

predicted

df['res'] = model.resid # compute residuals  
# compute mean and std of residuals
mean_res = df['res'].mean()
std_res = df['res'].std()
# estimating a threshold for anomalies ~ 2*std_res (can be tweaked as per need)
thres = 2 * std_res
df['is_anomaly'] = abs(df['res']) > thres
# output anomalies
print(f"Anomalies: {df[df['is_anomaly']]}")

如果我们绘制

resi

与

product_failed

（带有月份注释），我们会得到以下图

考虑不同的异常值检测构建基础，如何区分故障率

问题描述投票：0回答：1

1个回答

最新问题

考虑不同的异常值检测构建基础，如何区分故障率

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1