一种快速有效的方法来测试一个集合是否是其他n个集合的子集?

问题描述 投票:0回答:1

我有以下问题:我有一个物品清单

[O_0,..., O_n]

,其中每一项由二进制幂表示(o_0由2 ^ 0表示,...,o_n由2 ^ n表示)。我已经构造了这些元素的组合的列表(每个组合由项的二进制表示形式的总和表示)。例如,我有

combs = [3, 9, 15, ......].

鉴于这些项的新组合为C_1,我想测试combs中是否包含C_1的任何元素。我想到的一种高效,快捷的方法是为每个元素c_i from combs计算,测试c_i & C_1 == c_i是否意味着该元素正确。速度很快,因为我正在按位进行。我的问题是,除了拥有1个元素C_1之外,我还有很多元素C_1, ..., C_k,并且我必须针对每个条件测试上述条件。所以我想知道是否有比我提到的方法更快的方法来测试所有元素的条件(这实际上与测试一个集合是否是另一个集合的子集是相同的问题,这就是为什么我从中选择二进制表示形式的原因将问题转化为二进制的开始)。

performance binary set subset combinations
1个回答
0
投票

[我对问题的理解:给定kY的集合和mX的集合,我们希望找到S的子集Y,使得对于所有[ C0],存在y in S st x in Xx的子集。

我将假设集合由零和一个表示包含的y-矢量表示。这是设置:

n

import pandas as pd # for drop_duplicates & benchmarks import numpy as np np.random.seed(0) n = 100 # 100 "atomic" elements m = 1000 # small sets k = 1000 # large sets X = pd.DataFrame(np.random.randint(0, 2, size=(m, n))).drop_duplicates().values Y = pd.DataFrame(np.random.randint(0, 2, size=(k, n))).drop_duplicates().values # For each row y in Y, we would like to check if there exists a row x in X # s.t. x represents a subset of Y def naive(Y, X): # O(k^2 + m^2) for i, y in enumerate(Y): for x in X: found_subset = False if (x <= y).all(): yield i found_subset = True if found_subset: break def naive_as_array(Y, X): return np.array(list(naive(Y, X))) 函数在可能满足包含关系的所有对对中进行迭代,并在适当情况下发生短路。运行时为naive,其中O(m + k)

作为替代,我们可以考虑以下递归算法一次处理每个元素(从m = len(combs)1):

n

def contains(Y, X): """ Y : k x n indicator array specifying sets X : m x n indicator array specifying sets output: subset Z of [0..k-1] s.t. i in Z iff there exists x in X s.t. # x is a subset of Y[i]. Z is represented by a 1D numpy array. """ k, n = Y.shape assert Y.shape[1] == X.shape[1] detected = np.zeros(k, dtype=np.bool) inds = np.arange(k) # utility to account for sets that already have a subset detected def account_for_detected(Y, inds): mask = ~detected[inds] Y = Y[mask] inds = inds[mask] return Y, inds # inductively reduce Y.shape[1] (==X.shape[1]) def f(Y, X, inds): if Y.shape[0] == 0 or X.shape[0] == 0: # collection Y is empty; inculsions are impossible return # avoid redundant comparisons by dropping sets y in Y # if it is already known that y contains some element of X Y, inds = account_for_detected(Y, inds) if Y.shape[1] == 1: # Y and X are collections of singletons Y = np.ravel(Y) X = np.ravel(X) X_vals = np.zeros(2, dtype=np.int) X_vals[X] = 1 if X_vals[0] > 0: detected[inds] = True elif X_vals[1] > 0: detected[inds[Y==1]] = True return else: # make a recursive call Ymask = Y[:,0] == 0 Xmask = X[:,0] == 0 # if x in X is a subset of y in Y, x[0] <= y[0] f(Y[Ymask,1:], X[Xmask,1:], inds[Ymask]) # by now, detected is updated in the outer scope # process the remaining Y's f(Y[~Ymask,1:], X[:,1:], inds[~Ymask]) # done # make call at root: f(Y, X, inds) # return indices return np.where(detected)[0] d之间的步骤1,我们将集合n分为YY0,其中Y1包含Y0中不包含元素Y的集合,并且dY1中包含确实包含元素Y的集合。同样,我们定义dX0。一个主要的观察结果是,X1中的集合不能作为X1中的集合的子集出现。因此,我们可以减少递归调用中的比较次数。

时间:

Y0

%timeit contains(Y, X) %timeit naive_as_array(Y, X) 10 loops, best of 3: 185 ms per loop 1 loop, best of 3: 2.39 s per loop

© www.soinside.com 2019 - 2024. All rights reserved.