正如标题所示,我正在尝试创建一个算法,该算法接受长度为 n 的列表并创建 <=k sublists so that each sublist has as much matching bits at same positions as possible. (each sublist should have the largest possible common prefix)
输入的二进制数始终有 4 个字节。您可以将它们视为 IP 地址的二进制表示形式。另请记住,二进制数已经排序。
应该很简单,但我总是搞砸
示例:
k = 3
n = 6
binary_strings = [
"00000000000000000011011000010000",
"00000000001000000110000001100111",
"00000000010110000110001111010111",
"00000000010111010101010001010000",
"00000011110111101101011111010000",
"00000011111110000000111100001000",
]
该输入的输出应该是
00000011110111101101011111010000
00000011111110000000111100001000
00000000010111010101010001010000
00000000010110000110001111010111
00000000000000000011011000010000
00000000001000000110000001100111
k = 2
n = 4
binary_strings = [
"00101010101010000000000000101010",
"10000000000000010000000100000000",
"11000000101010000000000100000000",
"11000000110001001100010000000000",
]
该示例的输出:
00101010101010000000000000101010
10000000000000010000000100000000
11000000101010000000000100000000
11000000110001001100010000000000
我实际上会提供一个比我之前建议的更有效的解决方案。虽然写了递归函数,然后添加缓存层来memoize就可以了。
这将基于 heap 的思想,它是通过 heapq 算法在 Python 标准库中实现的。
我们的想法是,我们将拥有一堆
(cost, partial_solution)
对。想法是这样的:
while we have not reached the end:
take the cheapest partial solution in the heap.
if it reached the end:
We have our answer!
else if we have not reached this point before:
mark that we've reached it to avoid redundant work on worse solutions.
put extensions to this partial solution back in the heap.
这实际上是A*搜索算法的一个版本,具有启发式
0
。
这是代码。
import heapq
def group_strings (k, binary_strings):
# Sanity checks.
if k < 1:
if k == 0 and len(binary_strings) == 0:
return []
else:
return None
elif 0 == len(binary_strings):
return []
# Calculate the shared bits each shares with the next one.
shared_bits = []
for i in range(len(binary_strings) - 1):
max_bits = min(len(binary_strings[i]), len(binary_strings[i+1]))
shared_bits.append(max_bits)
for j in range(max_bits):
if binary_strings[i][j] != binary_strings[i+1][j]:
shared_bits[i] = j
break
# A partial solution will look like:
#
# (
# start index of last group,
# length of last group,
# shared bits of last group,
# number of groups,
# cost of all previous groups,
# (previous partial solution or none)
# )
#
# We can extend a partial solution in two ways.
#
# 1. Add one more to the current group.
#
# 2. Start a new group. We won't do this after k groups.
# Start with the first partial solution that we will processed.
heap = [(
1/2**len(binary_strings[0]),
(
0, # Start at beginning.
1, # 1 element
# Shares all of its bits with itself.
len(binary_strings[0]),
1, # Only this group
0, # No cost of previous groups.
None, # No previous group.
)
)]
# We have so far processed nothing.
seen = set()
# We will return from inside of this loop when we find our answer.
while True:
# Get the cheapest partial solution.
cost, partial = heapq.heappop(heap)
# Break the partial solution into its fields.
i, l, bits, groups, prev_cost, previous = partial
# Did we reach the end?
if i + l == len(binary_strings):
# We just need to extract this solution in a nice form.
answer = [binary_strings[i:i+l]]
while previous is not None:
i, l, bits, groups, prev_cost, previous = previous
answer.append(binary_strings[i:i+l])
# Return groups in the same order as binary_strings.
return list(reversed(answer))
# This will not lead to a better solution if we previously found
# something that cost no more which:
#
# - also included the first i+l binary_strings
# - with the same number of groups
# - the last group had the same number of bits.
key = (i+l, groups, bits)
if key not in seen:
seen.add(key)
# 1. Try adding one more to the last group.
# Our last binary_string is at i+l-1.
# Bits will be the smaller of the current value and
# what that shares with the next binary string.
next_bits = min(bits, shared_bits[i+l-1])
heapq.heappush(
heap,
(
# The new cost is prev_cost + size/2^next_bits.
prev_cost + (l+1) / 2**next_bits,
# Mostly the same partial solution that we just had.
(
i,
l+1, # The new group is one longer
next_bits, # This is how many bits it shares
groups,
prev_cost,
previous,
)
)
)
if groups < k:
# We could instead start a new group of length 1 at i+l.
heapq.heappush(
heap,
(
# cost is now the prev_cost.
cost + 1/2**len(binary_strings[i+l]),
(
i+l, # Starts where last group ended.
1, # Only one element.
# Shares all bits with itself.
len(binary_strings[i+l]),
groups + 1,
cost,
partial,
),
)
)
这是您的测试用例。
import pprint
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(group_strings(3, [
"00000000000000000011011000010000",
"00000000001000000110000001100111",
"00000000010110000110001111010111",
"00000000010111010101010001010000",
"00000011110111101101011111010000",
"00000011111110000000111100001000"]))
pp.pprint(group_strings(2, [
"00101010101010000000000000101010",
"10000000000000010000000100000000",
"11000000101010000000000100000000",
"11000000110001001100010000000000",
]))
对我来说这给了:
[ ['00000000000000000011011000010000', '00000000001000000110000001100111'],
['00000000010110000110001111010111', '00000000010111010101010001010000'],
['00000011110111101101011111010000', '00000011111110000000111100001000']]
[ ['00101010101010000000000000101010'],
[ '10000000000000010000000100000000',
'11000000101010000000000100000000',
'11000000110001001100010000000000']]
这看起来正是您想要的。
顺便说一句,部分解决方案的整个想法指向先前的部分解决方案,这是我经常做的事情。它基于链表的思想,并且通常需要像我所做的那样的解码步骤。但它的好处是不需要为每个部分解决方案复制大量数据。