如何在高度波动的数据点中查找和校正离群值

问题描述 投票:0回答:1

假设我有一个数组{5,30,7,8,9,10,1,46,3,4,70,12,13,14,15,16,99,18,19,90} 。如何将{{30,46,70,99,90}{5,7,8,9,10,1,3,4,12,13,15,16,18,19 }

实际上,我的最终目的是纠正所有异常值并将数组恢复为{5,6,7,8,9,10,1,2,3,4,11,12,13,14,15 ,16,17,18,19,20}? (您可以看到整个数组并不是一致地上升或下降,而是部分上升或下降。)

任何提示或指示将不胜感激

outliers
1个回答
0
投票

由于您未指定正在使用的任何编程语言或工具,我将在python中提供答案。

使用st。开发人员:

import numpy as np

s = np.array([5,30,7,8,9,10,1,46,3,4,70,12,13,14,15,16,99,18,19,90])
s.mean(), s.std() # (24.45, 28.178848450566605)

num_std_dev = 2 # tweak this for your use case
lower_bound = s.mean() - num_std_dev*s.std()
upper_bound = s.mean() + num_std_dev*s.std()

filtered_within_bounds = s[(s > lower_bound) & (s < upper_bound)]
# [ 5, 30,  7,  8,  9, 10,  1, 46,  3,  4, 70, 12, 13, 14, 15, 16, 18, 19]

filtered_outside_bounds = s[(s <= lower_bound) | (s >= upper_bound)]
# [99, 90]

或使用分位数:

import numpy as np

s = np.array([5,30,7,8,9,10,1,46,3,4,70,12,13,14,15,16,99,18,19,90])

lower_q = np.quantile(s, 0.25)
upper_q = np.quantile(s, 0.75)
iqr = upper_q - lower_q

lower_bound = lower_q - 1.5*iqr
upper_bound = upper_q + 1.5*iqr

filtered_within_bounds = s[(s > lower_bound) & (s < upper_bound)]
# [ 5, 30,  7,  8,  9, 10,  1,  3,  4, 12, 13, 14, 15, 16, 18, 19]

filtered_outside_bounds = s[(s <= lower_bound) | (s >= upper_bound)]
# [46, 70, 99, 90]

然后,如果您想将值裁剪到这些范围内,则可以使用裁剪功能。

s = np.array([5,30,7,8,9,10,1,46,3,4,70,12,13,14,15,16,99,18,19,90])

clipped_s = np.clip(s, a_min=lower_bound, a_max=upper_bound)
# [ 5.  , 30.  ,  7.  ,  8.  ,  9.  , 10.  ,  1.  , 42.75,  3.  ,
#   4.  , 42.75, 12.  , 13.  , 14.  , 15.  , 16.  , 42.75, 18.  ,
#  19.  , 42.75]

一天结束时,您可以对样本进行任何操作,但是使用标准偏差或IQR处理异常值将是决定可以安全忽略哪些值的最科学方法。

© www.soinside.com 2019 - 2024. All rights reserved.