使用np.ix_的Numpy并行索引

Question

我有一个Python脚本，在其中我大量使用numpy。我最近了解了有关numpys固有并行化的更多信息，并且应该避免使用numpy进行大型for循环，而应依赖于隐式并行索引。

我的脚本中有一个非常大的for循环（实际上包含另一个for循环）。该循环将数据拆分为训练集和测试集，以某种方式独立地操纵每个集合，拟合模型并重复。这是n个重复的k倍。

但是，脚本非常慢，如果n和k大，似乎会导致内存泄漏。我认为这也是由于巨大的for循环。因此，我想摆脱for循环并以并行方式使用拆分。但是，我没有成功。这是我的脚本中正在发生的事情：

这是通用的：

from sklearn.model_selection import RepeatedKFold
import ... 
#Some data is imported, functions defined, variables pre-allocated.

number_conditions = 10 #Usually bigger.
output = np.array(number_conditions, number_conditions) #This contains real data, though.
outer_k = 5
outer_reps = 1 # usually much higher
outer_rkf = RepeatedKFold(n_splits=outer_k, n_repeats=outer_reps)


array_for_indices = list(range(number_conditions))

这是无效的for循环版本：

for train_indices, test_indices in outer_rkf.split(array_for_indices):

    ixgrid_train = np.ix_(train_indices, train_indices)
    output_train = output[ixgrid_train]

    ixgrid_test = np.ix_(test_indices, test_indices)
    output_test = output[ixgrid_test]

    ...

这之后是一些繁重的数据操作和另一个嵌套的for循环，然后重复（outer_k * outer_reps）次。

我想对此进行并行化并尝试以下方法：

outer_split_1, outer_split_2, outer_split_3, outer_split_4, outer_split_5 = outer_rkf.split(array_for_indices)

index = np.array([outer_split_1, outer_split_2, outer_split_3, outer_split_4, outer_split_5])

all_train_inds = index[:, 0]
all_test_inds = index[:, 1]

ixgrid_train = np.ix_(all_train_inds, all_train_inds)
output_train = output[ixgrid_train]

但是最后一行给了我

IndexError：用作索引的数组必须是整数（或布尔值）类型

这对我来说很有意义：在for循环版本中，ixgrid_train是一个具有两个int数组的元组。在后者的版本中，它是一个具有两个对象数组的元组，对象数组又包含数组。

我试图使index不是数组，而是一个元组或列表，并尝试以不同的方式对index进行索引（这表明index可能是-待改进）。但是，我从未达到我想要的目标：output_train不仅包含一次拆分的所需数据，而且包含所有数据，并一次性分配。

我的问题的基础是，我不知道如何从根本上增加我所做工作的维度。我想我需要将在循环版本中循环的维转换为正在操作的对象的维，但是我不知道该怎么做才能产生实际的并行度。特别是，由于元组使我感到困惑。

如何实现我想要的：一次为所有折页创建所有索引，然后一次为每个折页分配所有选择的数据，以便以后可以并行处理呢？

我非常感谢您的投入！：）

Answer 1

这是否说明了您的问题：

In [186]: idx=np.array([np.array([1,2,3]), np.array([4,5])])                             
In [187]: idx                                                                            
Out[187]: array([array([1, 2, 3]), array([4, 5])], dtype=object)
In [188]: np.ix_(idx,idx)                                                                
Out[188]: 
(array([[array([1, 2, 3])],
        [array([4, 5])]], dtype=object),
 array([[array([1, 2, 3]), array([4, 5])]], dtype=object))
In [189]: np.ones((10,10),int)[_]                                                        
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-189-fa4d7e0befc5> in <module>
----> 1 np.ones((10,10),int)[_]

迭代情况是

In [190]: [np.arange(100).reshape(10,10)[np.ix_(row,row)] for row in idx]                
Out[190]: 
[array([[11, 12, 13],
        [21, 22, 23],
        [31, 32, 33]]),
 array([[44, 45],
        [54, 55]])]

索引不同大小的块（可能重叠）意味着不可能将其转换为一个n-d索引操作。

IndexError: arrays used as indices must be of integer (or boolean) type

您不希望这样的并集块

In [192]: np.arange(100).reshape(10,10)[np.ix_([1,2,3,4,5],[1,2,3,4,5])]                 
Out[192]: 
array([[11, 12, 13, 14, 15],
       [21, 22, 23, 24, 25],
       [31, 32, 33, 34, 35],
       [41, 42, 43, 44, 45],
       [51, 52, 53, 54, 55]])

您也不需要扁平化的元素：

In [193]: np.hstack([i.ravel() for i in _190])                                           
Out[193]: array([11, 12, 13, 21, 22, 23, 31, 32, 33, 44, 45, 54, 55])

使用np.ix_的Numpy并行索引

问题描述投票：0回答：1

1个回答

最新问题

使用np.ix_的Numpy并行索引

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1