如何通过bin-count条件合并直方图bin(边缘和计数)?

问题描述 投票:0回答:1

问题

我有一个要处理的数据直方图。更具体地说,我想合并计数小于给定阈值的垃圾箱。用一个例子可能会更清楚。

import numpy as np

np.random.seed(327)

data = np.random.normal(loc=50, scale=10, size=100).astype(int)
edges = np.arange(0, 101, 10).astype(int)
counts, edges = np.histogram(data, edges)

# print("\n .. {} DATA:\n{}\n".format(data.shape, data))
# print("\n .. {} EDGES:\n{}\n".format(edges.shape, edges))
# print("\n .. {} COUNTS:\n{}\n".format(counts.shape, counts))

如果没有注释掉,上面的print命令将输出以下内容:

 .. (100,) DATA:
[67 46 47 32 59 61 49 46 45 72 67 51 41 37 44 56 38 61 45 45 42 39 49 55
 32 35 52 40 55 34 52 51 39 55 50 62 47 43 48 39 53 54 75 38 53 44 46 39
 50 49 31 46 55 64 64 52 41 34 32 33 58 65 38 64 37 47 58 43 49 49 50 57
 71 44 41 39 47 51 47 63 55 52 43 43 49 65 48 43 44 38 64 49 62 41 40 67
 47 55 57 54]


 .. (11,) EDGES:
[  0  10  20  30  40  50  60  70  80  90 100]


 .. (10,) COUNTS:
[ 0  0  0 19 38 26 14  3  0  0]

[注意,counts建议data包含一个峰。假设我选择了bin阈值threshold=5,以便将任何包含少于5个计数(0, ..., 4计数;不包括5)的bin与next bin合并。在此,被认为是朝向中央峰的方向。

所需的输出

通过我想要的合并算法,我将获得以下输出:

edges = [30, 40, 50, 60, 80] counts = [19, 38, 26, 17]

尝试解决方案

以下是我为解决此问题所做的不正确尝试:

def agglomerate_bins(edges, counts, threshold): condition = (counts >= threshold) indices = {} indices['all'] = condition indices['above'] = np.where(condition == True)[0] indices['below'] = np.where(condition != True)[0] # merge left-side bins rightward left_edges = [edges[0]] left_counts = [] ileft, istop = indices['below'][0], indices['above'][0] while ileft < istop: cc = counts[ileft] while cc < threshold: ileft += 1 cc += counts[ileft] ee = edges[ileft] left_edges.append(ee) left_counts.append(cc) ileft += 1 # merge right-side bins leftward right_edges, right_counts = [], [] iright, istop = indices['below'][-1], indices['above'][-1] while iright > istop: cc = counts[iright] while cc < threshold: iright -= 1 cc += counts[iright] ee = edges[iright] right_edges.append(ee) right_counts.append(cc) iright -= 1 # group modified bins with bins above threshold middle_edges = edges[indices['above']].tolist() middle_counts = edges[indices['above']].tolist() mod_edges = np.array(left_edges + middle_edges + right_edges[::-1]) mod_counts = np.array(left_counts + middle_counts + right_counts[::-1]) return mod_edges, mod_counts mod_edges, mod_counts = agglomerate_bins(edges, counts, threshold=5) # print("\n .. {} MODIFIED EDGES:\n{}\n".format(mod_edges.shape, mod_edges)) # print("\n .. {} MODIFIED COUNTS:\n{}\n".format(mod_counts.shape, mod_counts))

如果没有注释掉,上面的print命令将输出以下内容:

.. (7,) MODIFIED EDGES: [ 0 30 30 40 50 60 60] .. (6,) MODIFIED COUNTS: [19 30 40 50 60 17]

问题是我想处理数据的直方图。更具体地说,我想合并计数小于给定阈值的垃圾箱。使用...
python-3.x numpy histogram nested-loops binning
1个回答
1
投票
我认为一种解决方案涉及遍历计数和边缘以合并计数并删除“未使用的”边缘。捕获[...,1,2,3,...] => [...,6,...]。 countsedges转换为允许轻松弹出不需要的项目的列表,这对于np.arrays无效。

import numpy as np np.random.seed(327) data = np.random.normal(loc=50, scale=10, size=100).astype(int) edges = np.arange(0, 101, 10).astype(int) counts, edges = np.histogram(data, edges) def combine_edges( counts, edges, threshold ): max_ix = counts.argmax() c_list = list( counts ) # Lists can be popped from e_list = list( edges ) # Lists can be popped from def eliminate_left( ix ): # Sum the count and eliminate the edge relevant to ix # Before the peak (max_ix) nonlocal max_ix max_ix -= 1 # max_ix will change too. c_list[ix+1]+=c_list[ix] c_list.pop(ix) e_list.pop(ix+1) def eliminate_right( ix ): # Sum the count and eliminate the edge relevant to ix # after the peak (max_ix) c_list[ix-1]+=c_list[ix] c_list.pop(ix) e_list.pop(ix) def first_lt(): # Find the first ix less than the threshold for ix, ct in enumerate( c_list[:max_ix] ): if ct < threshold: return ix # if ct < threshold return the index and exit the function # The function only reaches here if no ct values are less than the threshold return -1 # If zero items < threshold return -1 def last_lt(): # Find the last ix less than the threshold for ix, ct in zip( range(len(c_list)-1, max_ix, -1), c_list[::-1]): # ix reduces from len(c_list)-1, c_list is accessed in reverse order. if ct < threshold: return ix return -1 # If no items < threshold return -1 cont = True while cont: # Each iteration removes any counts less than threshold # before the peak. This process would combine e.g. counts of [...,1,2,3,...] into [..., 6, ...] ix = first_lt() if ix < 0: cont = False # If first_lt returns -1 stop while loop else: eliminate_left( ix ) cont = True while cont: ix = last_lt() if ix < 0: cont = False # If last_lt returns -1 stop while loop else: eliminate_right( ix ) return np.array( c_list ), np.array( e_list ) c, e = combine_edges( counts, edges, 5) print( c, '\n', e ) # [19 38 26 17] # [ 0 40 50 60 100] cts, edgs = np.histogram(data, e) print( cts, '\n', edgs ) # [19 38 26 17] # [ 0 40 50 60 100]

© www.soinside.com 2019 - 2024. All rights reserved.