正确舍入的双精度除法

问题描述 投票:2回答:1

我正在使用以下算法进行双精度除法,并尝试在浮点的软件仿真中正确舍入它。假设a为除数,而b为除数。

所有操作均在Q2.62中执行。

倒数的初始近似为enter image description here

b / 2b的有效位,并添加了其隐式位,并向右移一位。对于随后的内容,在写为[[a或b时,其含义是ab的有效位,并加上了隐式位。

enter image description here0x17504f333f9de6近似(在Q2.62中为0x5D413CCCFE779800。]

[之后,用牛顿-拉夫森(Newton-Raphson)迭代来近似倒数:enter image description here

倒数

r

有6个这样的迭代。商q通过将r乘以a(的有效位数)来计算。商的附加调整步骤:enter image description here

最后四舍五入为:

if a <= (a - q * b/2): result = final_biased_exponent | q else result = final_biased_exponent | adjusted_q

这在两种情况下均正常工作:a)结果为次正规或b)

a

b均为次正规。在那些情况下,它不能正确舍入,结果偏移1位(与x86结果相比)。(对数ab进行归一化,并且当对[[a或b中的任一个进行归一化时,指数也将按比例缩放。)在这些情况下,我也该怎么做才能正确取整?在我看来,在步骤x5失去了精度。由于在步骤x4上,倒数约为54位,因此适合64位变量。在步骤x5中,倒数近似为〜108位。因此,在步骤x5中,倒数的整个宽度不适合64位。当我将乘法乘以64位后将128位截断时,我感到我没有适当考虑这一点。
floating-point ieee-754 numerical-analysis numerical-computing
1个回答
0
投票
舍入步骤基本上从红利中减去原始商与除数的乘积,以形成余数rem_raw。它还会形成因商增加1 ulp而产生的余数rem_inc。通过构造,我们知道原始商是足够准确的,以至于原始商或其增量值都是正确的舍入结果。余数可以是正数,也可以是负数,也可以是负数/正数混合。幅度较小的余数对应于正确舍入的商。

四舍五入法线和次法线之间唯一的区别(除了后者固有的去正规化步骤外)是,正常情况下不会发生联系情况,而次正规结果可能会发生联系情况。参见,例如,

[MilošD. Ercegovac和TomásLang,“数字算术”,摩根考夫曼,2004年,第2页。 452

[使用定点算术进行计算时,原始商与除数的乘积是双倍长度的乘积。但是,因为我们知道初步商非常接近于真实结果,所以我们知道在从股息中减去所有高阶位都将抵消。因此,我们只需要计算和减去低阶乘积位即可计算两个余数。

[因为在[1,2,3]中将两个值除以导致(0.5,2)中的商,所以商的计算可能涉及归一化步骤以返回到区间[1,2),并伴有指数校正。在排列股息以及商与除数的乘积进行减法时,我们需要考虑这一点,请参见以下代码中的normalization_shift

由于以下代码具有探索性,因此编写该代码时并未考虑到极端的优化。各种调整(例如在指数处理中)都是可能的,用特定于平台的内在函数或内联汇编替换可移植代码也是如此。

#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <string.h> #define TEST_FP32_DIV (0) #define PURELY_RANDOM (1) #define PATTERN_BASED (2) #define TEST_MODE (PATTERN_BASED) uint32_t float_as_uint32 (float a) { uint32_t r; memcpy (&r, &a, sizeof r); return r; } float uint32_as_float (uint32_t a) { float r; memcpy (&r, &a, sizeof r); return r; } uint32_t umul32hi (uint32_t a, uint32_t b) { return (uint32_t)(((uint64_t)a*b) >> 32); } int clz32 (uint32_t a) { uint32_t r = 32; if (a >= 0x00010000) { a >>= 16; r -= 16; } if (a >= 0x00000100) { a >>= 8; r -= 8; } if (a >= 0x00000010) { a >>= 4; r -= 4; } if (a >= 0x00000004) { a >>= 2; r -= 2; } r -= a - (a & (a >> 1)); return r; } #define LOG2_TAB_ENTRIES (7) #define TAB_ENTRIES (1 << LOG2_TAB_ENTRIES) #define TAB_ENTRY_BITS (8) /* approximation error ~= 5.585e-3 */ const uint8_t rcp_tab [TAB_ENTRIES] = { 0xff, 0xfd, 0xfb, 0xf9, 0xf7, 0xf5, 0xf4, 0xf2, 0xf0, 0xee, 0xed, 0xeb, 0xe9, 0xe8, 0xe6, 0xe4, 0xe3, 0xe1, 0xe0, 0xde, 0xdd, 0xdb, 0xda, 0xd8, 0xd7, 0xd5, 0xd4, 0xd3, 0xd1, 0xd0, 0xcf, 0xcd, 0xcc, 0xcb, 0xca, 0xc8, 0xc7, 0xc6, 0xc5, 0xc4, 0xc2, 0xc1, 0xc0, 0xbf, 0xbe, 0xbd, 0xbc, 0xbb, 0xba, 0xb9, 0xb8, 0xb7, 0xb6, 0xb5, 0xb4, 0xb3, 0xb2, 0xb1, 0xb0, 0xaf, 0xae, 0xad, 0xac, 0xab, 0xaa, 0xa9, 0xa8, 0xa8, 0xa7, 0xa6, 0xa5, 0xa4, 0xa3, 0xa3, 0xa2, 0xa1, 0xa0, 0x9f, 0x9f, 0x9e, 0x9d, 0x9c, 0x9c, 0x9b, 0x9a, 0x99, 0x99, 0x98, 0x97, 0x97, 0x96, 0x95, 0x95, 0x94, 0x93, 0x93, 0x92, 0x91, 0x91, 0x90, 0x8f, 0x8f, 0x8e, 0x8e, 0x8d, 0x8c, 0x8c, 0x8b, 0x8b, 0x8a, 0x89, 0x89, 0x88, 0x88, 0x87, 0x87, 0x86, 0x85, 0x85, 0x84, 0x84, 0x83, 0x83, 0x82, 0x82, 0x81, 0x81, 0x80 }; #define FP32_MANT_BITS (23) #define FP32_EXPO_BITS (8) #define FP32_SIGN_MASK (0x80000000u) #define FP32_MANT_MASK (0x007fffffu) #define FP32_EXPO_MASK (0x7f800000u) #define FP32_MAX_EXPO (0xff) #define FP32_EXPO_BIAS (127) #define FP32_INFTY (0x7f800000u) #define FP32_QNAN_BIT (0x00400000u) #define FP32_QNAN_INDEFINITE (0xffc00000u) #define FP32_MANT_INT_BIT (0x00800000u) uint32_t fp32_div_core (uint32_t x, uint32_t y) { uint32_t expo_x, expo_y, expo_r, sign_r; uint32_t abs_x, abs_y, f, l, p, r, z, s; int x_is_zero, y_is_zero, normalization_shift; expo_x = (x & FP32_EXPO_MASK) >> FP32_MANT_BITS; expo_y = (y & FP32_EXPO_MASK) >> FP32_MANT_BITS; sign_r = (x ^ y) & FP32_SIGN_MASK; abs_x = x & ~FP32_SIGN_MASK; abs_y = y & ~FP32_SIGN_MASK; x_is_zero = (abs_x == 0); y_is_zero = (abs_y == 0); if ((expo_x == FP32_MAX_EXPO) || (expo_y == FP32_MAX_EXPO) || x_is_zero || y_is_zero) { int x_is_nan = (abs_x > FP32_INFTY); int x_is_inf = (abs_x == FP32_INFTY); int y_is_nan = (abs_y > FP32_INFTY); int y_is_inf = (abs_y == FP32_INFTY); if (x_is_nan) { r = x | FP32_QNAN_BIT; } else if (y_is_nan) { r = y | FP32_QNAN_BIT; } else if ((x_is_zero && y_is_zero) || (x_is_inf && y_is_inf)) { r = FP32_QNAN_INDEFINITE; } else if (x_is_zero || y_is_inf) { r = sign_r; } else if (y_is_zero || x_is_inf) { r = sign_r | FP32_INFTY; } } else { /* normalize any subnormals */ if (expo_x == 0) { s = clz32 (abs_x) - FP32_EXPO_BITS; x = x << s; expo_x = expo_x - (s - 1); } if (expo_y == 0) { s = clz32 (abs_y) - FP32_EXPO_BITS; y = y << s; expo_y = expo_y - (s - 1); } // expo_r = expo_x - expo_y + FP32_EXPO_BIAS; /* extract mantissas */ x = x & FP32_MANT_MASK; y = y & FP32_MANT_MASK; /* initial approx based on 7 most significant stored mantissa bits */ r = rcp_tab [y >> (FP32_MANT_BITS - LOG2_TAB_ENTRIES)]; /* make implicit integer bit of mantissa explicit */ x = x | FP32_MANT_INT_BIT; y = y | FP32_MANT_INT_BIT; /* pre-scale y for more efficient fixed-point computation */ z = y << FP32_EXPO_BITS; /* first NR iteration r1 = 2*r0-y*r0*r0 */ f = r * r; p = umul32hi (z, f << (32 - 2 * TAB_ENTRY_BITS)); r = (r << (32 - TAB_ENTRY_BITS)) - p; /* second NR iteration: r2 = r1*(2-y*r1) */ p = umul32hi (z, r << 1); f = 0u - p; r = umul32hi (f, r << 1); /* compute quotient as wide product x*(1/y) = x*r */ l = x * (r << 1); r = umul32hi (x, r << 1); /* normalize mantissa to be in [1,2] */ normalization_shift = (r & FP32_MANT_INT_BIT) == 0; if (normalization_shift) { expo_r -= 1; r = (r << 1) | (l >> (32 - 1)); } if ((expo_r > 0) && (expo_r < FP32_MAX_EXPO)) { /* result is normal */ int32_t rem_raw, rem_inc, inc; /* align x with product y*quotient */ x = x << (FP32_MANT_BITS + normalization_shift + 1); /* compute product y*quotient */ y = y << 1; p = y * r; /* compute x - y*quotient, for both raw and incremented quotient */ rem_raw = x - p; rem_inc = rem_raw - y; /* round to nearest: tie case _cannot_ occur */ inc = abs (rem_inc) < abs (rem_raw); /* build final results from sign, exponent, mantissa */ r = sign_r | (((expo_r - 1) << FP32_MANT_BITS) + r + inc); } else if ((int)expo_r >= FP32_MAX_EXPO) { /* result overflowed */ r = sign_r | FP32_INFTY; } else { /* result underflowed */ int denorm_shift = 1 - expo_r; if (denorm_shift > (FP32_MANT_BITS + 1)) { /* massive underflow */ r = sign_r; } else { int32_t rem_raw, rem_inc, inc; /* denormalize quotient */ r = r >> denorm_shift; /* align x with product y*quotient */ x = x << (FP32_MANT_BITS + normalization_shift + 1 - denorm_shift); /* compute product y*quotient */ y = y << 1; p = y * r; /* compute x - y*quotient, for both raw & incremented quotient*/ rem_raw = x - p; rem_inc = rem_raw - y; /* round to nearest or even: tie case _can_ occur */ inc = ((abs (rem_inc) < abs (rem_raw)) || (abs (rem_inc) == abs (rem_raw) && (r & 1))); /* build final result from sign and mantissa */ r = sign_r | (r + inc); } } } return r; } float fp32_div (float a, float b) { uint32_t x, y, r; x = float_as_uint32 (a); y = float_as_uint32 (b); r = fp32_div_core (x, y); return uint32_as_float (r); } uint64_t double_as_uint64 (double a) { uint64_t r; memcpy (&r, &a, sizeof r); return r; } double uint64_as_double (uint64_t a) { double r; memcpy (&r, &a, sizeof r); return r; } uint64_t umul64hi (uint64_t a, uint64_t b) { uint64_t a_lo = (uint64_t)(uint32_t)a; uint64_t a_hi = a >> 32; uint64_t b_lo = (uint64_t)(uint32_t)b; uint64_t b_hi = b >> 32; uint64_t p0 = a_lo * b_lo; uint64_t p1 = a_lo * b_hi; uint64_t p2 = a_hi * b_lo; uint64_t p3 = a_hi * b_hi; uint32_t cy = (uint32_t)(((p0 >> 32) + (uint32_t)p1 + (uint32_t)p2) >> 32); return p3 + (p1 >> 32) + (p2 >> 32) + cy; } int clz64 (uint64_t a) { uint64_t r = 64; if (a >= 0x100000000ULL) { a >>= 32; r -= 32; } if (a >= 0x000010000ULL) { a >>= 16; r -= 16; } if (a >= 0x000000100ULL) { a >>= 8; r -= 8; } if (a >= 0x000000010ULL) { a >>= 4; r -= 4; } if (a >= 0x000000004ULL) { a >>= 2; r -= 2; } r -= a - (a & (a >> 1)); return r; } #define FP64_MANT_BITS (52) #define FP64_EXPO_BITS (11) #define FP64_EXPO_MASK (0x7ff0000000000000ULL) #define FP64_SIGN_MASK (0x8000000000000000ULL) #define FP64_MANT_MASK (0x000fffffffffffffULL) #define FP64_MAX_EXPO (0x7ff) #define FP64_EXPO_BIAS (1023) #define FP64_INFTY (0x7ff0000000000000ULL) #define FP64_QNAN_BIT (0x0008000000000000ULL) #define FP64_QNAN_INDEFINITE (0xfff8000000000000ULL) #define FP64_MANT_INT_BIT (0x0010000000000000ULL) uint64_t fp64_div_core (uint64_t x, uint64_t y) { uint64_t expo_x, expo_y, expo_r, sign_r; uint64_t abs_x, abs_y, f, l, p, r, z, s; int x_is_zero, y_is_zero, normalization_shift; expo_x = (x & FP64_EXPO_MASK) >> FP64_MANT_BITS; expo_y = (y & FP64_EXPO_MASK) >> FP64_MANT_BITS; sign_r = (x ^ y) & FP64_SIGN_MASK; abs_x = x & ~FP64_SIGN_MASK; abs_y = y & ~FP64_SIGN_MASK; x_is_zero = (abs_x == 0); y_is_zero = (abs_y == 0); if ((expo_x == FP64_MAX_EXPO) || (expo_y == FP64_MAX_EXPO) || x_is_zero || y_is_zero) { int x_is_nan = (abs_x > FP64_INFTY); int x_is_inf = (abs_x == FP64_INFTY); int y_is_nan = (abs_y > FP64_INFTY); int y_is_inf = (abs_y == FP64_INFTY); if (x_is_nan) { r = x | FP64_QNAN_BIT; } else if (y_is_nan) { r = y | FP64_QNAN_BIT; } else if ((x_is_zero && y_is_zero) || (x_is_inf && y_is_inf)) { r = FP64_QNAN_INDEFINITE; } else if (x_is_zero || y_is_inf) { r = sign_r; } else if (y_is_zero || x_is_inf) { r = sign_r | FP64_INFTY; } } else { /* normalize any subnormals */ if (expo_x == 0) { s = clz64 (abs_x) - FP64_EXPO_BITS; x = x << s; expo_x = expo_x - (s - 1); } if (expo_y == 0) { s = clz64 (abs_y) - FP64_EXPO_BITS; y = y << s; expo_y = expo_y - (s - 1); } // expo_r = expo_x - expo_y + FP64_EXPO_BIAS; /* extract mantissas */ x = x & FP64_MANT_MASK; y = y & FP64_MANT_MASK; /* initial approx based on 7 most significant stored mantissa bits */ r = rcp_tab [y >> (FP64_MANT_BITS - LOG2_TAB_ENTRIES)]; /* make implicit integer bit of mantissa explicit */ x = x | FP64_MANT_INT_BIT; y = y | FP64_MANT_INT_BIT; /* pre-scale y for more efficient fixed-point computation */ z = y << FP64_EXPO_BITS; /* first NR iteration r1 = 2*r0-y*r0*r0 */ f = r * r; p = umul64hi (z, f << (64 - 2 * TAB_ENTRY_BITS)); r = (r << (64 - TAB_ENTRY_BITS)) - p; /* second NR iteration: r2 = r1*(2-y*r1) */ p = umul64hi (z, r << 1); f = 0u - p; r = umul64hi (f, r << 1); /* third NR iteration: r3 = r2*(2-y*r2) */ p = umul64hi (z, r << 1); f = 0u - p; r = umul64hi (f, r << 1); /* compute quotient as wide product x*(1/y) = x*r */ l = x * (r << 1); r = umul64hi (x, r << 1); /* normalize mantissa to be in [1,2] */ normalization_shift = (r & FP64_MANT_INT_BIT) == 0; if (normalization_shift) { expo_r -= 1; r = (r << 1) | (l >> (64 - 1)); } if ((expo_r > 0) && (expo_r < FP64_MAX_EXPO)) { /* result is normal */ int64_t rem_raw, rem_inc; int inc; /* align x with product y*quotient */ x = x << (FP64_MANT_BITS + 1 + normalization_shift); /* compute product y*quotient */ y = y << 1; p = y * r; /* compute x - y*quotient, for both raw and incremented quotient */ rem_raw = x - p; rem_inc = rem_raw - y; /* round to nearest: tie case _cannot_ occur */ inc = llabs (rem_inc) < llabs (rem_raw); /* build final results from sign, exponent, mantissa */ r = sign_r | (((expo_r - 1) << FP64_MANT_BITS) + r + inc); } else if ((int)expo_r >= FP64_MAX_EXPO) { /* result overflowed */ r = sign_r | FP64_INFTY; } else { /* result underflowed */ int denorm_shift = 1 - expo_r; if (denorm_shift > (FP64_MANT_BITS + 1)) { /* massive underflow */ r = sign_r; } else { int64_t rem_raw, rem_inc; int inc; /* denormalize quotient */ r = r >> denorm_shift; /* align x with product y*quotient */ x = x << (FP64_MANT_BITS + 1 + normalization_shift - denorm_shift); /* compute product y*quotient */ y = y << 1; p = y * r; /* compute x - y*quotient, for both raw & incremented quotient*/ rem_raw = x - p; rem_inc = rem_raw - y; /* round to nearest or even: tie case _can_ occur */ inc = ((llabs (rem_inc) < llabs (rem_raw)) || (llabs (rem_inc) == llabs (rem_raw) && (r & 1))); /* build final result from sign and mantissa */ r = sign_r | (r + inc); } } } return r; } double fp64_div (double a, double b) { uint64_t x, y, r; x = double_as_uint64 (a); y = double_as_uint64 (b); r = fp64_div_core (x, y); return uint64_as_double (r); } #if TEST_FP32_DIV // Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007 static uint32_t kiss_z=362436069,kiss_w=521288629, kiss_jsr=362436069,kiss_jcong=123456789; #define znew (kiss_z=36969*(kiss_z&0xffff)+(kiss_z>>16)) #define wnew (kiss_w=18000*(kiss_w&0xffff)+(kiss_w>>16)) #define MWC ((znew<<16)+wnew) #define SHR3 (kiss_jsr^=(kiss_jsr<<13),kiss_jsr^=(kiss_jsr>>17),kiss_jsr^=(kiss_jsr<<5)) #define CONG (kiss_jcong=69069*kiss_jcong+13579) #define KISS ((MWC^CONG)+SHR3) uint32_t v[8192]; int main (void) { uint64_t count = 0; float a, b, res, ref; uint32_t ires, iref, diff; uint32_t i, j, patterns, idx = 0, nbrBits = sizeof (v[0]) * CHAR_BIT; printf ("testing fp32 division\n"); /* pattern class 1: 2**i */ for (i = 0; i < nbrBits; i++) { v [idx] = ((uint32_t)1 << i); idx++; } /* pattern class 2: 2**i-1 */ for (i = 0; i < nbrBits; i++) { v [idx] = (((uint32_t)1 << i) - 1); idx++; } /* pattern class 3: 2**i+1 */ for (i = 0; i < nbrBits; i++) { v [idx] = (((uint32_t)1 << i) + 1); idx++; } /* pattern class 4: 2**i + 2**j */ for (i = 0; i < nbrBits; i++) { for (j = 0; j < nbrBits; j++) { v [idx] = (((uint32_t)1 << i) + ((uint32_t)1 << j)); idx++; } } /* pattern class 5: 2**i - 2**j */ for (i = 0; i < nbrBits; i++) { for (j = 0; j < nbrBits; j++) { v [idx] = (((uint32_t)1 << i) - ((uint32_t)1 << j)); idx++; } } /* pattern class 6: MAX_UINT/(2**i+1) rep. blocks of i zeros an i ones */ for (i = 0; i < nbrBits; i++) { v [idx] = ((~(uint32_t)0) / (((uint32_t)1 << i) + 1)); idx++; } patterns = idx; /* pattern class 6: one's complement of pattern classes 1 through 5 */ for (i = 0; i < patterns; i++) { v [idx] = ~v [i]; idx++; } /* pattern class 7: two's complement of pattern classes 1 through 5 */ for (i = 0; i < patterns; i++) { v [idx] = ~v [i] + 1; idx++; } patterns = idx; #if TEST_MODE == PURELY_RANDOM printf ("using purely random test vectors\n"); #elif TEST_MODE == PATTERN_BASED printf ("using pattern-based test vectors\n"); printf ("#patterns = %u\n", patterns); #endif // TEST_MODE do { #if TEST_MODE == PURELY_RANDOM a = uint32_as_float (KISS); b = uint32_as_float (KISS); #elif TEST_MODE == PATTERN_BASED a = uint32_as_float ((v [KISS % patterns] & FP32_MANT_MASK) | (KISS & ~FP32_MANT_MASK)); b = uint32_as_float ((v [KISS % patterns] & FP32_MANT_MASK) | (KISS & ~FP32_MANT_MASK)); #endif // TEST_MODE ref = a / b; res = fp32_div (a, b); ires = float_as_uint32 (res); iref = float_as_uint32 (ref); diff = (ires > iref) ? (ires - iref) : (iref - ires); if (diff) { printf ("a=% 15.6a b=% 15.6a res=% 15.6a ref=% 15.6a\n", a, b, res, ref); return EXIT_FAILURE; } count++; if ((count & 0xffffff) == 0) { printf ("\r%llu", count); } } while (1); return EXIT_SUCCESS; } #else /* TEST_FP32_DIV */ /* https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J From: geo <[email protected]> Newsgroups: sci.math,comp.lang.c,comp.lang.fortran Subject: 64-bit KISS RNGs Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST) This 64-bit KISS RNG has three components, each nearly good enough to serve alone. The components are: Multiply-With-Carry (MWC), period (2^121+2^63-1) Xorshift (XSH), period 2^64-1 Congruential (CNG), period 2^64 */ static uint64_t kiss64_x = 1234567890987654321ULL; static uint64_t kiss64_c = 123456123456123456ULL; static uint64_t kiss64_y = 362436362436362436ULL; static uint64_t kiss64_z = 1066149217761810ULL; static uint64_t kiss64_t; #define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c, \ kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \ kiss64_c += (kiss64_x < kiss64_t), kiss64_x) #define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \ kiss64_y ^= (kiss64_y << 43)) #define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL) #define KISS64 (MWC64 + XSH64 + CNG64) uint64_t v[32768]; int main (void) { uint64_t ires, iref, diff, count = 0; double a, b, res, ref; uint32_t i, j, patterns, idx = 0, nbrBits = sizeof (v[0]) * CHAR_BIT; printf ("testing fp64 division\n"); /* pattern class 1: 2**i */ for (i = 0; i < nbrBits; i++) { v [idx] = ((uint32_t)1 << i); idx++; } /* pattern class 2: 2**i-1 */ for (i = 0; i < nbrBits; i++) { v [idx] = (((uint32_t)1 << i) - 1); idx++; } /* pattern class 3: 2**i+1 */ for (i = 0; i < nbrBits; i++) { v [idx] = (((uint32_t)1 << i) + 1); idx++; } /* pattern class 4: 2**i + 2**j */ for (i = 0; i < nbrBits; i++) { for (j = 0; j < nbrBits; j++) { v [idx] = (((uint32_t)1 << i) + ((uint32_t)1 << j)); idx++; } } /* pattern class 5: 2**i - 2**j */ for (i = 0; i < nbrBits; i++) { for (j = 0; j < nbrBits; j++) { v [idx] = (((uint32_t)1 << i) - ((uint32_t)1 << j)); idx++; } } /* pattern class 6: MAX_UINT/(2**i+1) rep. blocks of i zeros an i ones */ for (i = 0; i < nbrBits; i++) { v [idx] = ((~(uint32_t)0) / (((uint32_t)1 << i) + 1)); idx++; } patterns = idx; /* pattern class 6: one's complement of pattern classes 1 through 5 */ for (i = 0; i < patterns; i++) { v [idx] = ~v [i]; idx++; } /* pattern class 7: two's complement of pattern classes 1 through 5 */ for (i = 0; i < patterns; i++) { v [idx] = ~v [i] + 1; idx++; } patterns = idx; #if TEST_MODE == PURELY_RANDOM printf ("using purely random test vectors\n"); #elif TEST_MODE == PATTERN_BASED printf ("using pattern-based test vectors\n"); printf ("#patterns = %u\n", patterns); #endif // TEST_MODE do { #if TEST_MODE == PURELY_RANDOM a = uint64_as_double (KISS64); b = uint64_as_double (KISS64); #elif TEST_MODE == PATTERN_BASED a = uint64_as_double ((v [KISS64 % patterns] & FP64_MANT_MASK) | (KISS64 & ~FP64_MANT_MASK)); b = uint64_as_double ((v [KISS64 % patterns] & FP64_MANT_MASK) | (KISS64 & ~FP64_MANT_MASK)); #endif // TEST_MODE ref = a / b; res = fp64_div (a, b); ires = double_as_uint64 (res); iref = double_as_uint64 (ref); diff = (ires > iref) ? (ires - iref) : (iref - ires); if (diff) { printf ("a=% 23.13a b=% 23.13a res=% 23.13a ref=% 23.13a\n", a, b, res, ref); return EXIT_FAILURE; } count++; if ((count & 0xffffff) == 0) { printf ("\r%llu", count); } } while (1); return EXIT_SUCCESS; } #endif /* TEST_FP32_DIV */

© www.soinside.com 2019 - 2024. All rights reserved.