如何降低C中双精度？

Question

我试图降低C中double变量的精度来测试对结果的影响。我尝试做一个按位&，但它给出了一个错误。

我如何在float和double变量上执行此操作？

Answer 1

如果您希望应用按位和&，则需要将其应用于float值的整数表示：

float f = 0.1f;
printf("Befor: %a %.16e\n", f, f);
unsigned int i;
_Static_assert(sizeof f == sizeof i, "pick integer type of the correct size");
memcpy(&i, &f, sizeof i);
i &= ~ 0x3U; // or any other mask.
            // This one assumes the endianness of floats is identical to integers'
memcpy(&f, &i, sizeof f);
printf("After: %a %.16e\n", f, f);

请注意，这不会为您提供类似29位IEEE-754的数字。 f中的值首先被舍入为32位单精度数，然后被残酷地截断。

更优雅的方法依赖于设置了两位的浮点常量：

float f = 0.1f;
float factor = 5.0f; // or 3, or 9, or 17
float c = factor * f;
f = c - (c - f);
printf("After: %a %.16e\n", f, f);

这种方法的优点是它使用N位有效数将f舍入到最接近的值，而不是像第一种方法那样将其截断为零。但是，该程序仍在使用32位IEEE 754浮点进行计算，然后舍入到较少的位，因此结果仍然不等于较窄的浮点类型产生的结果。

第二种方法依赖于Dekker的想法，在this article在线描述。

Answer 2

如何降低C中双精度？

为了降低浮点数的相对精度，使得significand/mantissa的各个最低有效位为零，代码需要访问有效数。

使用frexp()提取FP编号的符号和指数。

使用ldexp()缩放signicand，然后根据编码目标舍入，截断或覆盖 - 以消除精度。显示截断，但我建议通过rint()进行舍入

缩小并添加指数。

#include <math.h>
#include <stdio.h>

double reduce(double x, int precision_power_2) {
  if (isfinite(x)) {
    int power_2;

    // The frexp functions break a floating-point number into a 
    // normalized fraction and an integral power of 2.
    double normalized_fraction = frexp(x, &power_2);  // 0.5 <= result < 1.0 or 0

    // The ldexp functions multiply a floating-point number by an integral power of 2
   double less_precise = trunc(ldexp(normalized_fraction, precision_power_2));
   x = ldexp(less_precise, power_2 - precision_power_2);

  }
  return x;
}

void testr(double x, int pow2) {
  printf("reduce(%a, %d --> %a\n", x, pow2, reduce(x, pow2));
}

int main(void) {
  testr(0.1, 5);
  return 0;
}

产量

//       v-53 bin.digs-v             v-v 5 significant binary digits  
reduce(0x1.999999999999ap-4, 5 --> 0x1.9p-4

使用frexpf()，ldexp()，rintf()，truncf()，floorf()等为float。

如何降低C中双精度？

问题描述投票：0回答：2

2个回答

最新问题

如何降低C中双精度？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2