我有一个二进制文件,在已知位置包含整数。我想寻找一个特定的整数或字节序列。考虑到字节版本类似于字符串,我想到了转换为整数的想法会更有效。
我应该将每个字节序列转换为整数以进行比较还是使用字节?
包括每个int的字节序列的大小如何影响这一点,例如使用3个字节会导致掩盖奇数字节的开销?
为了回答,我设计了一个小脚本。最慢的部分是生成随机数。
import random
import time
endianness = "big"
bytes_given_to_int = 3
def some_nums(top_end: int):
return [x + random.randint(0, 42) for x in range(top_end)]
top_end = 9_999_999
bs = [x.to_bytes(bytes_given_to_int, endianness) for x in some_nums(top_end)]
bs2 = [x.to_bytes(bytes_given_to_int, endianness) for x in some_nums(top_end)]
cs = some_nums(top_end)
cs2 = some_nums(top_end)
iicmps=[]
bbcmps=[]
bicmps=[]
ibcmps=[]
for i in range(10):
t0 = time.time()
rs1 = [1 for i in range(top_end) if bs[i] == bs2[i]]
# Each sum should be roughly equal. Using it here helps instruct the interpreter/compiler
# not to optimise away the arbitrary task we give it.
print(sum(rs1))
bbcmps.append((time.time() - t0))
t0 = time.time()
rs2 = [1 for i in range(top_end) if cs[i] == cs2[i]]
print(sum(rs2))
iicmps.append((time.time() - t0))
t0 = time.time()
rs3 = [1 for i in range(top_end) if int.from_bytes(bs[i], endianness) == cs[i]]
print(sum(rs3))
bicmps.append((time.time() - t0))
t0 = time.time()
rs4 = [1 for i in range(top_end) if cs[i].to_bytes(bytes_given_to_int, endianness) == bs[i]]
print(sum(rs4))
ibcmps.append((time.time() - t0))
print("Comparing byte sequences took {:.2f}s".format(sum(bbcmps) / len(bbcmps)))
print("Comparing ints took {:.2f}s".format(sum(iicmps) / len(iicmps)))
print("Comparing byte sequences (converted to ints) to ints took {:.2f}s".format(sum(bicmps) / len(bicmps)))
print("Comparing ints (converted to byte sequences) to byte sequences took {:.2f}s".format(sum(ibcmps) / len(ibcmps)))
比较字节序列花费了1.54s
比较整数花费了1.50秒
比较字节序列(转换为int)与int花费了3.53s
将整数(转换为字节序列)与字节序列进行比较花费了3.52s
好吧,这表明字节序列是一流的对象,因为我们甚至可以推断出,即使字符串是列表上的单板。
其他字长会怎样?让我们尝试4。
比较字节序列花费了5.89s
比较整数花费了1.31秒
比较字节序列(转换为int)与int花费了3.29s
将整数(转换为字节序列)与字节序列进行比较花费了3.20s
嗯,有趣。从3字节整数的情况中推断出的几乎所有内容都是错误的。我想我们需要运行此脚本来了解我们的环境将在哪种环境下发挥最佳性能。