返回计数大于阈值的所有分仓的数据指数。

Question

我正试图找到指数都在某个bin内的数据binned喜欢这个。

import numpy as np

x=np.random.random(1000)
y=np.random.random(1000)
#The bins are not evenly spaced and not the same number in x and y. 
xedges=np.array(0.1,0.2, 0.4, 0.5, 0.55, 0.6, 0.8, 0.9)
yedges=np.arange(0.1,0.2, 0.4, 0.5, 0.55, 0.6, 0.8, 0.9)

h=np.histogram2d(x,y, bins=[xedges,yedges])

我想找到每个bin中包含的指数（然后绘制它们等），这些指数大于某个阈值的计数。因此，每个计数大于阈值的bin是一个 "簇"，我想知道该簇中的所有数据点（x,y）。

我用伪代码写了我认为的工作方式。

thres=5 
mask=(h>5)

for i in mask:
    # for each bin with count > thres 
    # get bin edges for x and y directions 

    # find  (rightEdge < x < leftEdge) and (rightEdge < y < leftEdge)

    # return indices for each True in mask 

plt.plot(x[indices], y[indicies])

我试着阅读了一些函数的文档，比如 scipy.stats.binned_statistic2d 和 pandas.DataFrame.groupby 但我不知道如何将它应用到我的数据中。对于binned_statistic2d，他们要求提供一个参数。values :

计算统计的数据。这必须是与x相同的形状，或者是一组序列--每一个都是与x相同的形状。

而我不知道如何输入我想要计算的数据。

谢谢你在这个问题上能提供的任何帮助。

Answer 1

如果我的理解没错的话，你想在原始点上建立一个掩码，表明该点属于一个超过5个点的bin。

要构建这样一个掩码。np.histogram2d 返回每个bin的计数，但并不指示哪个点进入哪个bin。

您可以通过迭代每个满足条件的bin来构建这样一个掩码，并将所有对应的点指数添加到掩码中。

为了直观地显示 np.histogram2d, plt.pcolormesh 可以使用。繪製網格時，可使用 h > 5 将显示所有 True 最高的颜色（红色）的值和 False 值的最低颜色（蓝色）。

from matplotlib import pyplot as plt
import numpy as np

x = np.random.uniform(0, 2, 500)
y = np.random.uniform(0, 1, x.shape)

xedges = np.array([0.1, 0.2, 0.5, 0.55, 0.6, 0.8, 1.0, 1.3, 1.5, 1.9])
yedges = np.array([0.1, 0.2, 0.4, 0.5, 0.55, 0.6, 0.8, 0.9])

hist, _xedges, _yedges = np.histogram2d(x, y, bins=[xedges, yedges])

h = hist.T  # np.histogram2d transposes x and y, therefore, transpose the resulting array
thres = 5
desired = h > thres
plt.pcolormesh(xedges, yedges, desired, cmap='coolwarm', ec='white', lw=2)

mask = np.zeros_like(x, dtype=np.bool)  # start with mask all False
for i in range(len(xedges) - 1):
    for j in range(len(yedges) - 1):
        if desired[j, i]:
            # print(f'x from {xedges[i]} to {xedges[i + 1]} y from {yedges[j]} to {yedges[j + 1]}')
            mask = np.logical_or(mask, (x >= xedges[i]) & (x < xedges[i + 1]) & (y >= yedges[j]) & (y < yedges[j + 1]))
            # plt.scatter(np.random.uniform(xedges[i], xedges[i+1], 100), np.random.uniform(yedges[j], yedges[j+1], 100),
            #             marker='o', color='g', alpha=0.3)
plt.scatter(x, y, marker='o', color='gold', label='initial points')
plt.scatter(x[mask], y[mask], marker='.', color='green', label='filtered points')
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left')
plt.tight_layout()
plt.show()

请注意，在给定的例子中，边缘并没有覆盖完整的点的范围。给定的边缘之外的点将不会被考虑在内。要包含这些点，只需扩展边缘。

返回计数大于阈值的所有分仓的数据指数。

问题描述投票：1回答：1

1个回答

最新问题

返回计数大于阈值的所有分仓的数据指数。

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1