如何对数据框的行进行采样，以固定组内的特定分布？

Question

我有一个DataFrame c 含列 a.

import numpy as np
a = np.random.randint(0,10, size=(100))
c = pd.DataFrame(a, columns=['a'])

我想对行的随机分组，以使 c 使得每组内有5行，且每组有1行，且有 a < 3

所以举例来说。

[1,2,3,2,10]  <-- good group 
[1,1,3,4,6]  <-- good group
[2,4,7,3,7] <-- bad group

如果我用完了符合这个标准的行（例如我用完了符合这个标准的行） a < 1)，然后忽略其余的数据帧。

目前我通过创建一个新的列 group_id 并分别 c 条件，然后从它们中反复采样，直到我用完候选者。

c['group_id'] = None
c_w_small_a = c[c.a < 3].copy()
c_w_large_a = c[c.a >= 3].copy()
group_id = 0
while len(c_w_small_a) >= 1 and len(c_w_large_a) >= 4:
   c.loc[c_w_small_a.sample(1, replace=False).index, 'group_id'] = group_id
   c.loc[c_w_large_a.sample(4, replace=False).index, 'group_id'] = group_id
   group_id += 1

c = c[c.group_id.apply(lambda x,x is not None)] # filter rows without id
c_groups = c.groupby('group_id')

这种方法的问题是，我不能概括这种方法在更复杂的条件下，子集相互重叠，例如：

最多两行 a > 2 并且至少有1行是'a == 3'。

我不知道如何用这样的方式来编码，使我能用这种分组方式得到最大的分组数。例如，如果a ==3是非常有限的，那么我不想让a> 2选择3，即使这满足了它的条件。

Answer 1

我不确定，但我认为你所描述的问题是NP-完整的，为此我建议你想一个启发式来找到一个满意的解决方案.为此你可以写一个贪婪的启发式，看起来像这样:

def is_satisfying(group):
...     if (np.sum(group > 2) > 2) or (np.sum(group == 3) < 1): 
...             return False
...     else:
...             return True

然后为了构造一个组，你可以写这样的东西。

group = []
while len(group) != 4 : 
...     np.append(group, df['a'].sample(n=1))
...     if not is_satisfying(group):
...             group = group[:-1]

为了标记那些已经被添加到组中的元素，你可以使用一些数据结构，使你能够在取样前过滤数据框。

如何对数据框的行进行采样，以固定组内的特定分布？

问题描述投票：-1回答：1

1个回答

最新问题

如何对数据框的行进行采样，以固定组内的特定分布？

问题描述 投票：-1回答：1

1个回答

最新问题

问题描述投票：-1回答：1