我有一个超过 100 个点的数据集,需要将其分为一定数量的组。每个组都有最大人数,但并非组中的所有位置都需要填补。
example = [7,10,8,2,6,19,34,2,5,45,23,9]
group_count = 5
group_size = 3
根据这些规则将数据集分组,一种可能性如下。 (随机未优化)
group1=[2,19,45]
group2=[5,8,23]
group3=[2,10]
group4=[7,9,34]
group5=[6]
数据集分为 5 组,每组最多 3 个点。每组中的点按顺序依次增加。
评估基于每组中相邻数字对之间的绝对差。评估
group1=[2,19,45]
时,前两个点之间的绝对差为 19-2=17
,第二对点之间的绝对差为 45-19=26
如果我们对所有组执行相同的操作,则最近的两个点是距
9-7=2
的 group4
,最远的点是距 45-19=26
的 group1
。你也可以论证它是 group5
6
因为没有其他对,所以它有无限的分离。目标是优化分选方法以最大化最小分离值。
为了改进这个解决方案,我们可以从
9
中取出group4
并将其添加到group5
中
group1=[2,19,45]
group2=[5,8,23]
group3=[2,10]
group4=[7,34]
group5=[6,9]
现在两点之间新的最小间距已从 2 增加到 3。这是解决方案的改进。 目标是优化分选方法以最大化最小分离值。
这个问题过去已经通过暴力手动排序来解决(不理想!)。希望对此进行改进,但我不确定该看什么方法。
这可以通过 ILP 实现(正如我所演示的,通过 PuLP;matlab 有等效项)。由于您对“无限距离”的评论,最小组分配大小已设置为 2。为您的样本数据生成的解决方案将最小间隙从 3 改进到 5。
import pulp
values = [7, 10, 8, 2, 6, 19, 34, 2, 5, 45, 23, 9]
group_count = 5
group_size = 3
values_size = len(values)
values = tuple(sorted(values))
# The values are not unique, so they need to be given indices
value_idx = range(values_size)
group_idx = range(group_count)
assigns = pulp.LpVariable.matrix(
name='assign_g%d_v%02d', indices=(group_idx, value_idx),
cat=pulp.LpBinary,
)
lowest_diff = pulp.LpVariable(name='diff', cat=pulp.LpContinuous)
prob = pulp.LpProblem(name='sort_binning', sense=pulp.LpMaximize)
prob.setObjective(lowest_diff)
for group, row in zip(group_idx, assigns):
n_assigns = pulp.lpSum(row)
# Due to OP's "unitary group means infinite distance", mandate at least
# two assignments per group
prob.addConstraint(
name=f'group_{group}_min', constraint=n_assigns >= 2,
)
prob.addConstraint(
name=f'group_{group}_max', constraint=n_assigns <= group_size,
)
for value in value_idx:
prob.addConstraint(
name=f'excl_v{value}', constraint=pulp.lpSum(
assigns[group][value] for group in group_idx
) == 1,
)
for start_idx, start_value in zip(value_idx[:-1], values[:-1]):
for end_idx, end_value in zip(value_idx[start_idx+1:], values[start_idx+1:]):
# big-M coefficient based on highest value in sub-sequence
M = 2*(end_idx - start_idx + 1)*end_value
for group, row in zip(group_idx, assigns):
(
start_assign, *inner_excludes, end_assign,
) = row[start_idx: end_idx+1]
# Constrain lowest_diff <= end_value - start_value IFF
# start_assign == 1
# end_assign == 1
# sum(inner_excludes) == 0
prob.addConstraint(
name=f'diff_g{group}_{start_idx:02}_{end_idx:02}',
constraint=lowest_diff <= end_value - start_value
+ M*(
pulp.lpSum(inner_excludes)
+ (1 - start_assign)
+ (1 - end_assign)
)
)
print(prob)
prob.solve()
assert prob.status == pulp.LpStatusOptimal
for group, row in zip(group_idx, assigns):
print(f'Group {group}: ', ', '.join(
str(value)
for value, assign in zip(values, row)
if assign.value() > 0.5
))
...
Result - Optimal solution found
Objective value: 5.00000000
Enumerated nodes: 0
Total iterations: 381
Time (CPU seconds): 0.06
Time (Wallclock seconds): 0.07
Option for printingOptions changed from normal to all
Total time (CPU seconds): 0.06 (Wallclock seconds): 0.07
Group 0: 6, 45
Group 1: 5, 10, 19
Group 2: 8, 23, 34
Group 3: 2, 9
Group 4: 2, 7