Python 检测直方图中的孤立边缘,以检测时间序列数据中的异常值

问题描述 投票:0回答:1

我正在尝试以自己的方式找出异常值。如何?绘制直方图,搜索具有少量计数和零计数邻居或边缘的孤立边缘。通常它们将位于直方图的远端。这些可能是异常值。检测并丢弃它们。它是什么样的数据?来自现场的时间序列。有时,当传感器无法及时传输数据并且数据记录器存储这些奇怪的数字时,您会看到奇怪的数字(传感器数据约为 50-100,异常值可能是 -10000、1000)。它们是暂时的,可能在一年的数据中出现几次,并且不到总样本的 1%。

我的代码:

# vals, edges = np.histogram(df['column'],bins=20)
# obtained result is 
vals = [    38      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      1     11 126664  13853   4536]
edges = [ 0.        2.911165  5.82233   8.733495 11.64466  14.555825 17.46699
 20.378155 23.28932  26.200485 29.11165  32.022815 34.93398  37.845145
 40.75631  43.667475 46.57864  49.489805 52.40097  55.312135 58.2233  ]

# repeat last sample twice in the vals. Why: because vals always have one sample less than edges
vals = np.append(vals, vals[-1])
vedf = pd.DataFrame(data = {'edges':edges,'vals':vals})
# Replace all zero samples with NaN. Hence, these rows will not recognized. 
vedf['vals'] = vedf['vals'].replace(0,np.nan)
# Identify the isolated edges by looking the number of samples, say, < 50
vedf['IsolatedEdge?'] = vedf['vals'] <50
# plot histogram
plt.plot(vedf['edges'],vedf['vals'],'o')
plt.show()

当前输出:

这不是正确的输出。为什么?在值 0 处一开始只有一个孤立边。但是,在这里,我的代码将 43 和 46 处的值检测为孤立边,只是因为它们的计数较少。

vedf = 

      edges     vals    IsolatedEdge?
0   0.000000    38.0    True
1   2.911165    NaN     False
2   5.822330    NaN     False
3   8.733495    NaN     False
4   11.644660   NaN     False
5   14.555825   NaN     False
6   17.466990   NaN     False
7   20.378155   NaN     False
8   23.289320   NaN     False
9   26.200485   NaN     False
10  29.111650   NaN     False
11  32.022815   NaN     False
12  34.933980   NaN     False
13  37.845145   NaN     False
14  40.756310   NaN     False
15  43.667475   1.0     True
16  46.578640   11.0    True
17  49.489805   126664.0    False
18  52.400970   13853.0     False
19  55.312135   4536.0  False
20  58.223300   4536.0  False

预期输出:

vedf = 

      edges     vals    IsolatedEdge?
0   0.000000    38.0    True
1   2.911165    NaN     False
2   5.822330    NaN     False
3   8.733495    NaN     False
4   11.644660   NaN     False
5   14.555825   NaN     False
6   17.466990   NaN     False
7   20.378155   NaN     False
8   23.289320   NaN     False
9   26.200485   NaN     False
10  29.111650   NaN     False
11  32.022815   NaN     False
12  34.933980   NaN     False
13  37.845145   NaN     False
14  40.756310   NaN     False
15  43.667475   1.0     False
16  46.578640   11.0    False
17  49.489805   126664.0    False
18  52.400970   13853.0     False
19  55.312135   4536.0  False
20  58.223300   4536.0  False

一旦我知道某个特定的边缘是孤立的,我就可以删除该边缘中的所有样本。

python pandas dataframe numpy histogram
1个回答
0
投票

此方法使用

for
循环。对于每个 bin,它检查该 bin 是否满足 3 个条件:(1) 当前 bin 的值 > 0 并且左侧的 bin 为空(或没有左侧 bin),并且 (3) 右侧的 bin 也是空的。空(或没有右侧垃圾箱)。如果满足所有这些条件,它将当前 bin 标记为已隔离。< 50, and (2) the bin

# vals, edges = np.histogram(df['column'],bins=20) # obtained result is vals = [ 38 , 0, 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0, 0 , 0 , 0 , 0 , 0 , 1 , 11, 12.6664 ,13.853, 4.536] edges = [ 0. , 2.911165, 5.82233 , 8.733495, 11.64466 , 14.555825 ,17.46699, 20.378155 ,23.28932 ,26.200485 ,29.11165 ,32.022815, 34.93398 ,37.845145, 40.75631 , 43.667475 ,46.57864 , 49.489805, 52.40097 ,55.312135, 58.2233 ] plt.stem(edges[:-1], vals) is_isolated = [] for bin_idx in range(len(vals)): has_left_bin = True if bin_idx > 0 else False has_right_bin = True if bin_idx < len(vals) - 1 else False if (has_left_bin and vals[bin_idx - 1]==0) or not has_left_bin: left_empty = True else: left_empty = False if (has_right_bin and vals[bin_idx + 1]==0) or not has_right_bin: right_empty = True else: right_empty = False if (0 < vals[bin_idx] < 50) and left_empty and right_empty: is_isolated.append(True) else: is_isolated.append(False) vdef = pd.DataFrame({'vals': vals, 'edges': edges[:-1], 'is_isolated': is_isolated}) vdef

© www.soinside.com 2019 - 2024. All rights reserved.