我必须使用Python3(PyPy实现)分析大量数据,其中我对相当大的浮点数进行一些操作,并且必须检查结果是否足够接近整数。
举个例子,假设我正在生成随机的数字对,并检查它们是否形成毕达哥拉斯三元组(是具有整数边的直角三角形的边):
from math import hypot
from pprint import pprint
from random import randrange
from time import time
def gen_rand_tuples(start, stop, amount):
'''
Generates random integer pairs and converts them to tuples of floats.
'''
for _ in range(amount):
yield (float(randrange(start, stop)), float(randrange(start, stop)))
t0 = time()
## Results are those pairs that results in integer hypothenuses, or
## at least very close, to within 1e-12.
results = [t for t in gen_rand_tuples(1, 2**32, 10_000_000) if abs((h := hypot(*t)) - int(h)) < 1e-12]
print('Results found:')
pprint(results)
print('finished in:', round(time() - t0, 2), 'seconds.')
运行它我得到:
Python 3.9.17 (a61d7152b989, Aug 13 2023, 10:27:46)
[PyPy 7.3.12 with GCC 13.2.1 20230728 (Red Hat 13.2.1-1)] on linux
Type "help", "copyright", "credits" or "license()" for more information.
>>>
===== RESTART: /home/user/Downloads/pythagorean_test_floats.py ====
Results found:
[(2176124225.0, 2742331476.0),
(342847595.0, 3794647043.0),
(36.0, 2983807908.0),
(791324089.0, 2122279232.0)]
finished in: 2.64 seconds.
有趣,它运行速度很快,在 2 秒多一点的时间内处理了 1000 万个数据点,我什至找到了一些匹配的数据。假设显然是整数:
>>> pprint([hypot(*x) for x in results])
[3500842551.0, 3810103759.0, 2983807908.0, 2265008378.0]
但事实并非如此,如果我们使用小数任意精度模块检查结果,我们会发现结果实际上并不够接近整数:
>>> from decimal import Decimal
>>> pprint([(x[0]*x[0] + x[1]*x[1]).sqrt() for x in (tuple(map(Decimal, x)) for x in results)])
[Decimal('3500842551.000000228516418075'),
Decimal('3810103758.999999710375341513'),
Decimal('2983807908.000000217172157183'),
Decimal('2265008377.999999748566051441')]
所以,我认为问题是数字足够大,落在 python 浮点数缺乏精度的范围内,因此返回误报。
现在,我们只需更改程序即可在任何地方使用任意精度的小数:
from decimal import Decimal
from pprint import pprint
from random import randrange
from time import time
def dec_hypot(x, y):
return (x*x + y*y).sqrt()
def gen_rand_tuples(start, stop, amount):
'''
Generates random integer pairs and converts them to tuples of decimals.
'''
for _ in range(amount):
yield (Decimal(randrange(start, stop)), Decimal(randrange(start, stop)))
t0 = time()
## Results are those pairs that results in integer hypothenuses, or
## at least very close, to within 1e-12.
results = [t for t in gen_rand_tuples(1, 2**32, 10_000_000) if abs((h := dec_hypot(*t)) - h.to_integral_value()) < Decimal(1e-12)]
print('Results found:')
pprint(results)
print('finished in:', round(time() - t0, 2), 'seconds.')
现在我们没有收到任何误报,但我们的性能受到了很大的影响。以前需要 2 秒多一点,现在需要 100 多秒。看来小数对 JIT 不友好:
====== RESTART: /home/user/Downloads/pythagorean_test_dec.py ======
Results found:
[]
finished in: 113.82 seconds.
我找到了问题的这个答案,CPython 和 PyPy 十进制运算性能,建议使用双双精度数字作为小数的更快、JIT 友好的替代品,以获得比内置浮点数更好的精度。于是我pip安装了doubledouble第三方模块,并相应更改了程序:
from doubledouble import DoubleDouble
from decimal import Decimal
from pprint import pprint
from random import randrange
from time import time
def dd_hypot(x, y):
return (x*x + y*y).sqrt()
def gen_rand_tuples(start, stop, amount):
for _ in range(amount):
yield (DoubleDouble(randrange(start, stop)), DoubleDouble(randrange(start, stop)))
t0 = time()
print('Results found:')
results = [t for t in gen_rand_tuples(1, 2**32, 10_000_000) if abs((h := dd_hypot(*t)) - int(h)) < DoubleDouble(1e-12)]
pprint(results)
print('finished in:', round(time() - t0, 2), 'seconds.')
但我收到此错误:
======= RESTART: /home/user/Downloads/pythagorean_test_dd.py ======
Results found:
Traceback (most recent call last):
File "/home/user/Downloads/pythagorean_test_dd.py", line 24, in <module>
results = [t for t in gen_rand_tuples(1, 2**32, 10_000_000) if abs((h := dd_hypot(*t)) - int(h)) < DoubleDouble(1e-12)]
File "/home/user/Downloads/pythagorean_test_dd.py", line 24, in <listcomp>
results = [t for t in gen_rand_tuples(1, 2**32, 10_000_000) if abs((h := dd_hypot(*t)) - int(h)) < DoubleDouble(1e-12)]
TypeError: int() argument must be a string, a bytes-like object or a number, not 'DoubleDouble'
我认为问题是模块没有指定转换或舍入到最接近的整数方法。我能写的最好的就是一个极其人为的“int”函数,它通过字符串和小数之间的往返,然后返回到 DoubleDouble,将 double-double 舍入到最接近的整数:
def contrived_int(dd):
rounded_str = (Decimal(dd.x) + Decimal(dd.y)).to_integral_value()
hi = float(rounded_str)
lo = float(Decimal(rounded_str) - Decimal(hi))
return DoubleDouble(hi, lo)
但是它非常迂回,违背了回避小数的目的,并使程序比全十进制版本更慢。
那么我问,有没有一种快速的方法可以直接将双双精度数舍入到最接近的整数,而不需要经过小数或字符串的中间步骤?
不是您直接提出的问题的答案,但这里至少有一种方法来检查任何大小的整数是否是完美的平方(我确信有更快的方法,但至少这应该始终有效并且是对数复杂性):
def is_square(n):
low = 0
high = 1
while high * high <= n:
low = high
high *= 2
while low < high:
mid = (low + high) >> 1
if mid * mid == n:
return True
if mid * mid > n:
high = mid
else:
low = mid + 1
return False
这只是进行二分查找。