从列表的 n 个二进制数中创建 MAX k 个子列表，以便每个子列表具有尽可能多的匹配位

Question

正如标题所示，我正在尝试创建一个算法，该算法接受长度为 n 的列表并创建 <=k sublists so that each sublist has as much matching bits at same positions as possible. (each sublist should have the largest possible common prefix)

输入的二进制数始终有 4 个字节。您可以将它们视为 IP 地址的二进制表示形式。另请记住，二进制数已经排序。

应该很简单，但我总是搞砸

示例：

k = 3
n = 6
binary_strings = [
    "00000000000000000011011000010000",
    "00000000001000000110000001100111",
    "00000000010110000110001111010111",
    "00000000010111010101010001010000",
    "00000011110111101101011111010000",
    "00000011111110000000111100001000",
]

该输入的输出应该是

00000011110111101101011111010000
00000011111110000000111100001000

00000000010111010101010001010000
00000000010110000110001111010111

00000000000000000011011000010000
00000000001000000110000001100111

k = 2
n = 4
binary_strings = [
    "00101010101010000000000000101010",
    "10000000000000010000000100000000",
    "11000000101010000000000100000000",
    "11000000110001001100010000000000",
]

该示例的输出：

00101010101010000000000000101010

10000000000000010000000100000000
11000000101010000000000100000000
11000000110001001100010000000000

Answer 1

我实际上会提供一个比我之前建议的更有效的解决方案。虽然写了递归函数，然后添加缓存层来memoize就可以了。

这将基于 heap 的思想，它是通过 heapq 算法在 Python 标准库中实现的。

我们的想法是，我们将拥有一堆

(cost, partial_solution)

对。想法是这样的：

while we have not reached the end:
    take the cheapest partial solution in the heap.
    if it reached the end:
        We have our answer!
    else if we have not reached this point before:
       mark that we've reached it to avoid redundant work on worse solutions.
       put extensions to this partial solution back in the heap.

这实际上是A*搜索算法的一个版本，具有启发式

。

这是代码。

import heapq

def group_strings (k, binary_strings):
    # Sanity checks.
    if k < 1:
        if k == 0 and len(binary_strings) == 0:
            return []
        else:
            return None
    elif 0 == len(binary_strings):
        return []


    # Calculate the shared bits each shares with the next one.
    shared_bits = []
    for i in range(len(binary_strings) - 1):
        max_bits = min(len(binary_strings[i]), len(binary_strings[i+1]))
        shared_bits.append(max_bits)
        for j in range(max_bits):
            if binary_strings[i][j] != binary_strings[i+1][j]:
                shared_bits[i] = j
                break

    # A partial solution will look like:
    #
    #   (
    #     start index of last group,
    #     length of last group,
    #     shared bits of last group,
    #     number of groups,
    #     cost of all previous groups,
    #     (previous partial solution or none)
    #   )
    #
    # We can extend a partial solution in two ways.
    #
    # 1. Add one more to the current group.
    #
    # 2. Start a new group. We won't do this after k groups.

    # Start with the first partial solution that we will processed.
    heap = [(
        1/2**len(binary_strings[0]),
        (
            0,    # Start at beginning.
            1,    # 1 element
                  # Shares all of its bits with itself.
            len(binary_strings[0]),
            1,    # Only this group
            0,    # No cost of previous groups.
            None, # No previous group.
        )
    )]

    # We have so far processed nothing.
    seen = set()

    # We will return from inside of this loop when we find our answer.
    while True:
        # Get the cheapest partial solution.
        cost, partial = heapq.heappop(heap)

        # Break the partial solution into its fields.
        i, l, bits, groups, prev_cost, previous = partial

        # Did we reach the end?
        if i + l == len(binary_strings):
            # We just need to extract this solution in a nice form.
            answer = [binary_strings[i:i+l]]
            while previous is not None:
                i, l, bits, groups, prev_cost, previous = previous
                answer.append(binary_strings[i:i+l])

            # Return groups in the same order as binary_strings.
            return list(reversed(answer))

        # This will not lead to a better solution if we previously found
        # something that cost no more which:
        #
        #   - also included the first i+l binary_strings
        #   - with the same number of groups
        #   - the last group had the same number of bits.
        key = (i+l, groups, bits)

        if key not in seen:
            seen.add(key)
            # 1. Try adding one more to the last group.

            # Our last binary_string is at i+l-1.
            # Bits will be the smaller of the current value and
            # what that shares with the next binary string.
            next_bits = min(bits, shared_bits[i+l-1])
            heapq.heappush(
                heap,
                (
                    # The new cost is prev_cost + size/2^next_bits.
                    prev_cost + (l+1) / 2**next_bits,
                    # Mostly the same partial solution that we just had.
                    (
                        i,
                        l+1,        # The new group is one longer
                        next_bits,  # This is how many bits it shares
                        groups,
                        prev_cost,
                        previous,
                    )
                )
            )

            if groups < k:
                # We could instead start a new group of length 1 at i+l.
                heapq.heappush(
                    heap,
                    (
                        # cost is now the prev_cost.
                        cost + 1/2**len(binary_strings[i+l]),
                        (
                            i+l, # Starts where last group ended.
                            1,   # Only one element.
                                 # Shares all bits with itself.
                            len(binary_strings[i+l]),
                            groups + 1,
                            cost,
                            partial,
                        ),
                    )
                )

这是您的测试用例。

import pprint
pp = pprint.PrettyPrinter(indent=4)

pp.pprint(group_strings(3, [
    "00000000000000000011011000010000",
    "00000000001000000110000001100111",
    "00000000010110000110001111010111",
    "00000000010111010101010001010000",
    "00000011110111101101011111010000",
    "00000011111110000000111100001000"]))

pp.pprint(group_strings(2, [
   "00101010101010000000000000101010",
   "10000000000000010000000100000000",
   "11000000101010000000000100000000",
   "11000000110001001100010000000000",
]))

对我来说这给了：

[   ['00000000000000000011011000010000', '00000000001000000110000001100111'],
    ['00000000010110000110001111010111', '00000000010111010101010001010000'],
    ['00000011110111101101011111010000', '00000011111110000000111100001000']]
[   ['00101010101010000000000000101010'],
    [   '10000000000000010000000100000000',
        '11000000101010000000000100000000',
        '11000000110001001100010000000000']]

这看起来正是您想要的。

顺便说一句，部分解决方案的整个想法指向先前的部分解决方案，这是我经常做的事情。它基于链表的思想，并且通常需要像我所做的那样的解码步骤。但它的好处是不需要为每个部分解决方案复制大量数据。

从列表的 n 个二进制数中创建 MAX k 个子列表，以便每个子列表具有尽可能多的匹配位

问题描述投票：0回答：1

1个回答

最新问题

从列表的 n 个二进制数中创建 MAX k 个子列表，以便每个子列表具有尽可能多的匹配位

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1