我正在尝试以自己的方式找出异常值。如何?绘制直方图,搜索具有少量计数和零计数邻居或边缘的孤立边缘。通常它们将位于直方图的远端。这些可能是异常值。检测并丢弃它们。它是什么样的数据?来自现场的时间序列。有时,当传感器无法及时传输数据并且数据记录器存储这些奇怪的数字时,您会看到奇怪的数字(传感器数据约为 50-100,异常值可能是 -10000、1000)。它们是暂时的,可能在一年的数据中出现几次,并且不到总样本的 1%。
我的代码:
# vals, edges = np.histogram(df['column'],bins=20)
# obtained result is
vals = [ 38 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 11 126664 13853 4536]
edges = [ 0. 2.911165 5.82233 8.733495 11.64466 14.555825 17.46699
20.378155 23.28932 26.200485 29.11165 32.022815 34.93398 37.845145
40.75631 43.667475 46.57864 49.489805 52.40097 55.312135 58.2233 ]
# repeat last sample twice in the vals. Why: because vals always have one sample less than edges
vals = np.append(vals, vals[-1])
vedf = pd.DataFrame(data = {'edges':edges,'vals':vals})
# Replace all zero samples with NaN. Hence, these rows will not recognized.
vedf['vals'] = vedf['vals'].replace(0,np.nan)
# Identify the isolated edges by looking the number of samples, say, < 50
vedf['IsolatedEdge?'] = vedf['vals'] <50
# plot histogram
plt.plot(vedf['edges'],vedf['vals'],'o')
plt.show()
当前输出:
这不是正确的输出。为什么?在值 0 处一开始只有一个孤立边。但是,在这里,我的代码将 43 和 46 处的值检测为孤立边,只是因为它们的计数较少。
vedf =
edges vals IsolatedEdge?
0 0.000000 38.0 True
1 2.911165 NaN False
2 5.822330 NaN False
3 8.733495 NaN False
4 11.644660 NaN False
5 14.555825 NaN False
6 17.466990 NaN False
7 20.378155 NaN False
8 23.289320 NaN False
9 26.200485 NaN False
10 29.111650 NaN False
11 32.022815 NaN False
12 34.933980 NaN False
13 37.845145 NaN False
14 40.756310 NaN False
15 43.667475 1.0 True
16 46.578640 11.0 True
17 49.489805 126664.0 False
18 52.400970 13853.0 False
19 55.312135 4536.0 False
20 58.223300 4536.0 False
预期输出:
vedf =
edges vals IsolatedEdge?
0 0.000000 38.0 True
1 2.911165 NaN False
2 5.822330 NaN False
3 8.733495 NaN False
4 11.644660 NaN False
5 14.555825 NaN False
6 17.466990 NaN False
7 20.378155 NaN False
8 23.289320 NaN False
9 26.200485 NaN False
10 29.111650 NaN False
11 32.022815 NaN False
12 34.933980 NaN False
13 37.845145 NaN False
14 40.756310 NaN False
15 43.667475 1.0 False
16 46.578640 11.0 False
17 49.489805 126664.0 False
18 52.400970 13853.0 False
19 55.312135 4536.0 False
20 58.223300 4536.0 False
一旦我知道某个特定的边缘是孤立的,我就可以删除该边缘中的所有样本。
此方法使用
for
循环。对于每个 bin,它检查该 bin 是否满足 3 个条件:(1) 当前 bin 的值 > 0 并且左侧的 bin 为空(或没有左侧 bin),并且 (3) 右侧的 bin 也是空的。空(或没有右侧垃圾箱)。如果满足所有这些条件,它将当前 bin 标记为已隔离。< 50, and (2) the bin
# vals, edges = np.histogram(df['column'],bins=20)
# obtained result is
vals = [ 38 , 0, 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0,
0 , 0 , 0 , 0 , 0 , 1 , 11, 12.6664 ,13.853, 4.536]
edges = [ 0. , 2.911165, 5.82233 , 8.733495, 11.64466 , 14.555825 ,17.46699,
20.378155 ,23.28932 ,26.200485 ,29.11165 ,32.022815, 34.93398 ,37.845145,
40.75631 , 43.667475 ,46.57864 , 49.489805, 52.40097 ,55.312135, 58.2233 ]
plt.stem(edges[:-1], vals)
is_isolated = []
for bin_idx in range(len(vals)):
has_left_bin = True if bin_idx > 0 else False
has_right_bin = True if bin_idx < len(vals) - 1 else False
if (has_left_bin and vals[bin_idx - 1]==0) or not has_left_bin:
left_empty = True
else:
left_empty = False
if (has_right_bin and vals[bin_idx + 1]==0) or not has_right_bin:
right_empty = True
else:
right_empty = False
if (0 < vals[bin_idx] < 50) and left_empty and right_empty:
is_isolated.append(True)
else:
is_isolated.append(False)
vdef = pd.DataFrame({'vals': vals, 'edges': edges[:-1], 'is_isolated': is_isolated})
vdef