因此,我在单个平面数组中有大量数据,这些数据被分组为大小不规则的块。这些块的大小在另一个数组中给出。我需要做的是基于第三个索引数组重新排列块(想想花哨的索引)
这些块总是> = 3长,通常是4,但是从技术上讲是无界的,因此填充最大长度和掩码是不可行的。另外,由于技术原因,我只能访问numpy,所以像scipy或pandas一样。
为了更易于阅读,本示例中的数据易于分组。在真实数据中,数字可以是任何数字,并且不遵循此模式。
[[EDIT]更新时具有较少混乱的数据
data = np.array([1,2,3,4, 11,12,13, 21,22,23,24, 31,32,33,34, 41,42,43, 51,52,53,54])
chunkSizes = np.array([4, 3, 4, 4, 3, 4])
newOrder = np.array([0, 5, 4, 5, 2, 1])
在这种情况下的预期输出是
np.array([1,2,3,4, 51,52,53,54, 41,42,43, 51,52,53,54, 21,22,23,24, 11,12,13])
由于实际数据可能长达数百万个,所以我希望可以使用某种numpy魔术来实现此功能而无需python循环。
方法#1
这里是基于创建规则数组和遮罩的矢量化->
def chunk_rearrange(data, chunkSizes, newOrder): m = chunkSizes[:,None] > np.arange(chunkSizes.max()) d1 = np.empty(m.shape, dtype=data.dtype) d1[m] = data return d1[newOrder][m[newOrder]]
给定样本的输出-
In [4]: chunk_rearrange(data, chunkSizes, newOrder) Out[4]: array([0, 0, 0, 0, 5, 5, 5, 5, 4, 4, 4, 5, 5, 5, 5, 2, 2, 2, 2, 1, 1, 1])
方法#2
另一种基于cumsum
的矢量化矢量,对于那些非常参差不齐的
def chunk_rearrange_cumsum(data, chunkSizes, newOrder):
# Setup ID array that will hold specific values at those interval starts,
# such that a final cumsum would lead us to the indices which when indexed
# by the input array gives us the re-arranged o/p
idar = np.ones(len(data), dtype=int)
# New chunk lengths
newlens = chunkSizes[newOrder]
# Original chunk intervals
c = np.r_[0,chunkSizes[:-1].cumsum()]
# Indices from original order that form the interval starts in new arrangement
d1 = c[newOrder]
# Starts of chunks in new arrangement where those from d1 are to be assigned
c2 = np.r_[0,newlens[:-1].cumsum()]
# Offset required for the starts in new arrangement for final cumsum to work
diffs = np.diff(d1)+1-np.diff(c2)
idar[c2[1:]] = diffs
idar[0] = d1[0]
# Final cumsum and indexing leads to desired new arrangement
out = data[idar.cumsum()]
return out
如果使用np.split
建立索引,则可以使用data
在与[chunkSizes]相对应的np.cumsum
数组中创建视图。然后,您可以使用花式索引根据newOrder索引对视图进行重新排序。这应该是相当有效的,因为仅当您在重新排序的视图上调用np.concatenate
时才将数据复制到新数组: