假设我有一个数组{5,30,7,8,9,10,1,46,3,4,70,12,13,14,15,16,99,18,19,90} 。如何将{{30,46,70,99,90}与{5,7,8,9,10,1,3,4,12,13,15,16,18,19 }?
实际上,我的最终目的是纠正所有异常值并将数组恢复为{5,6,7,8,9,10,1,2,3,4,11,12,13,14,15 ,16,17,18,19,20}? (您可以看到整个数组并不是一致地上升或下降,而是部分上升或下降。)
任何提示或指示将不胜感激
由于您未指定正在使用的任何编程语言或工具,我将在python中提供答案。
使用st。开发人员:
import numpy as np
s = np.array([5,30,7,8,9,10,1,46,3,4,70,12,13,14,15,16,99,18,19,90])
s.mean(), s.std() # (24.45, 28.178848450566605)
num_std_dev = 2 # tweak this for your use case
lower_bound = s.mean() - num_std_dev*s.std()
upper_bound = s.mean() + num_std_dev*s.std()
filtered_within_bounds = s[(s > lower_bound) & (s < upper_bound)]
# [ 5, 30, 7, 8, 9, 10, 1, 46, 3, 4, 70, 12, 13, 14, 15, 16, 18, 19]
filtered_outside_bounds = s[(s <= lower_bound) | (s >= upper_bound)]
# [99, 90]
或使用分位数:
import numpy as np
s = np.array([5,30,7,8,9,10,1,46,3,4,70,12,13,14,15,16,99,18,19,90])
lower_q = np.quantile(s, 0.25)
upper_q = np.quantile(s, 0.75)
iqr = upper_q - lower_q
lower_bound = lower_q - 1.5*iqr
upper_bound = upper_q + 1.5*iqr
filtered_within_bounds = s[(s > lower_bound) & (s < upper_bound)]
# [ 5, 30, 7, 8, 9, 10, 1, 3, 4, 12, 13, 14, 15, 16, 18, 19]
filtered_outside_bounds = s[(s <= lower_bound) | (s >= upper_bound)]
# [46, 70, 99, 90]
然后,如果您想将值裁剪到这些范围内,则可以使用裁剪功能。
s = np.array([5,30,7,8,9,10,1,46,3,4,70,12,13,14,15,16,99,18,19,90])
clipped_s = np.clip(s, a_min=lower_bound, a_max=upper_bound)
# [ 5. , 30. , 7. , 8. , 9. , 10. , 1. , 42.75, 3. ,
# 4. , 42.75, 12. , 13. , 14. , 15. , 16. , 42.75, 18. ,
# 19. , 42.75]
一天结束时,您可以对样本进行任何操作,但是使用标准偏差或IQR处理异常值将是决定可以安全忽略哪些值的最科学方法。