将值列表分组/排序为 n 组设定大小,以最大化每个值之间的差异

问题描述 投票:0回答:1

我有一个超过 100 个点的数据集,需要将其分为一定数量的组。每个组都有最大人数,但并非组中的所有位置都需要填补。

example = [7,10,8,2,6,19,34,2,5,45,23,9]
group_count = 5
group_size = 3

根据这些规则将数据集分组,一种可能性如下。 (随机未优化)

group1=[2,19,45]
group2=[5,8,23]
group3=[2,10]
group4=[7,9,34]
group5=[6]

数据集分为 5 组,每组最多 3 个点。每组中的点按顺序依次增加。

评估基于每组中相邻数字对之间的绝对差。评估

group1=[2,19,45]
时,前两个点之间的绝对差为
19-2=17
,第二对点之间的绝对差为
45-19=26

如果我们对所有组执行相同的操作,则最近的两个点是距

9-7=2
group4
,最远的点是距
45-19=26
group1
。你也可以论证它是
group5
6
因为没有其他对,所以它有无限的分离。目标是优化分选方法以最大化最小分离值。

为了改进这个解决方案,我们可以从

9
中取出
group4
并将其添加到
group5

group1=[2,19,45]
group2=[5,8,23]
group3=[2,10]
group4=[7,34]
group5=[6,9]

现在两点之间新的最小间距已从 2 增加到 3。这是解决方案的改进。 目标是优化分选方法以最大化最小分离值。

这个问题过去已经通过暴力手动排序来解决(不理想!)。希望对此进行改进,但我不确定该看什么方法。

matlab sorting optimization grouping mathematical-optimization
1个回答
0
投票

这可以通过 ILP 实现(正如我所演示的,通过 PuLP;matlab 有等效项)。由于您对“无限距离”的评论,最小组分配大小已设置为 2。为您的样本数据生成的解决方案将最小间隙从 3 改进到 5。

import pulp

values = [7, 10, 8, 2, 6, 19, 34, 2, 5, 45, 23, 9]
group_count = 5
group_size = 3
values_size = len(values)
values = tuple(sorted(values))

# The values are not unique, so they need to be given indices
value_idx = range(values_size)
group_idx = range(group_count)

assigns = pulp.LpVariable.matrix(
    name='assign_g%d_v%02d', indices=(group_idx, value_idx),
    cat=pulp.LpBinary,
)

lowest_diff = pulp.LpVariable(name='diff', cat=pulp.LpContinuous)
prob = pulp.LpProblem(name='sort_binning', sense=pulp.LpMaximize)
prob.setObjective(lowest_diff)

for group, row in zip(group_idx, assigns):
    n_assigns = pulp.lpSum(row)
    # Due to OP's "unitary group means infinite distance", mandate at least
    # two assignments per group
    prob.addConstraint(
        name=f'group_{group}_min', constraint=n_assigns >= 2,
    )
    prob.addConstraint(
        name=f'group_{group}_max', constraint=n_assigns <= group_size,
    )

for value in value_idx:
    prob.addConstraint(
        name=f'excl_v{value}', constraint=pulp.lpSum(
            assigns[group][value] for group in group_idx
        ) == 1,
    )

for start_idx, start_value in zip(value_idx[:-1], values[:-1]):
    for end_idx, end_value in zip(value_idx[start_idx+1:], values[start_idx+1:]):
        # big-M coefficient based on highest value in sub-sequence
        M = 2*(end_idx - start_idx + 1)*end_value

        for group, row in zip(group_idx, assigns):
            (
                start_assign, *inner_excludes, end_assign,
            ) = row[start_idx: end_idx+1]

            # Constrain lowest_diff <= end_value - start_value IFF
            #    start_assign == 1
            #    end_assign == 1
            #    sum(inner_excludes) == 0
            prob.addConstraint(
                name=f'diff_g{group}_{start_idx:02}_{end_idx:02}',
                constraint=lowest_diff <= end_value - start_value
                + M*(
                    pulp.lpSum(inner_excludes)
                    + (1 - start_assign)
                    + (1 - end_assign)
                )
            )

print(prob)
prob.solve()
assert prob.status == pulp.LpStatusOptimal

for group, row in zip(group_idx, assigns):
    print(f'Group {group}: ', ', '.join(
        str(value)
        for value, assign in zip(values, row)
        if assign.value() > 0.5
    ))
...
Result - Optimal solution found

Objective value:                5.00000000
Enumerated nodes:               0
Total iterations:               381
Time (CPU seconds):             0.06
Time (Wallclock seconds):       0.07

Option for printingOptions changed from normal to all
Total time (CPU seconds):       0.06   (Wallclock seconds):       0.07

Group 0:  6, 45
Group 1:  5, 10, 19
Group 2:  8, 23, 34
Group 3:  2, 9
Group 4:  2, 7
© www.soinside.com 2019 - 2024. All rights reserved.