是否可以使用有限精度浮点运算基于ARM伪代码实现符合IEEE 754的浮点运算？

Question

背景：通常浮点运算是使用整数运算来实现的（例如Berkeley SoftFloat）。根据 ARM 伪代码 [1]，浮点运算是使用无限精度浮点运算（类型

real

）实现的。

我的 32 位浮点运算模型是用 C 编写的，基于 ARM 伪代码。类型

real

使用有限精度浮点算术实现：64 位

double

或 80 位

long double

（在 x86_64 上）或 128 位

long double

（在 AArch64 上）：

typedef double Real;
//typedef long double Real;

在测试它时，我注意到一些失败：最完全与丢失

Inexact

和/或

Underflow

异常有关。在某些情况下，结果会有 +/-1 位偏差。

背景：与基于整数算术的实现（检查某些位是否非零）相比，ARM 伪代码函数

FPRoundBase

计算

error

:

// Get the unrounded mantissa as an integer, and the "units in last place" rounding error.
int_mant = RoundDown(mantissa * 2.0^F);  // < 2.0^F if biased_exp == 0, >= 2.0^F if not
error = mantissa * 2.0^F - Real(int_mant);

引发

Inexact

和/或

Underflow

异常取决于此

error

:

if !altfp && biased_exp == 0 && (error != 0.0 || trapped_UF) then
    if fpexc then FPProcessException(FPExc_Underflow, fpcr);
...
if error != 0.0 then
    if fpexc then FPProcessException(FPExc_Inexact, fpcr);

我的问题：在某些情况下，

error

为零，而预期它不为零，导致缺少

Inexact

和/或

Underflow

异常。但请注意，在这些情况下，数值结果是正确的。这是

x + y

的示例：

x                        -4.96411207e-35         0x8683f7ff
y                        -3.98828101             0xc07f3fff
x after FPUnpack         -4.9641120695506692e-35 0xb8d07effe0000000
y after FPUnpack         -3.9882810115814209     0xc00fe7ffe0000000
x+y                      -3.9882810115814209     0xc00fe7ffe0000000
=== FPRoundBase ===
op                       -3.9882810115814209     0xc00fe7ffe0000000 
exponent                 1
min_exp                  -126
biased_exp               128
int_mant                 16728063
mantissa                 1.9941405057907104      0x3fffe7ffe0000000
frac_size                23
error                    0                       0x0
===

在这里我们看到

error

为零，而预期它不为零。

如果我们将

1.9941405057907104

乘以

2^23

，我们将得到

16728062.9999999995871232

，四舍五入为

16728063

，

16728063 - 16728063

为

。

我尝试在计算时局部提高精度

error

：修复了一些故障，出现了新的故障。我还尝试了其他一些“怪癖和调整”，得到了相同的结果：修复了一些故障，出现了新的故障。

请注意，

Real

（即

double

）上的所有操作都是使用

FE_TONEAREST

完成的。

最后，我开始思考：是否有可能使用有限精度浮点运算基于ARM伪代码实现符合IEEE 754的32位（例如）浮点运算？

[1] 探索工具（“Arm A64 指令集架构”部分，“下载 XML”按钮），文件

ISA_A64_xml_A_profile-2023-03/ISA_A64_xml_A_profile-2023-03/xhtml/shared_pseudocode.html

。

UPD0。我注意到 128 位

long double

比 64 位

double

减少了 50% 的故障。

UPD1。 “无错误”意味着“符合 IEEE 754”。更改为“符合 IEEE 754”。

Answer 1

我开始使用GNU MPFR:

typedef mpfr_t          Real;

测试表明：

可以使用有限精度浮点运算基于 ARM 伪代码实现符合 IEEE 754 的 32 位（例如）浮点运算；
对于每个 FP 运算，达到“IEEE 754 一致性”属性的最小 MPFR 精度是不同的。示例待补充。

是否可以使用有限精度浮点运算基于ARM伪代码实现符合IEEE 754的浮点运算？

问题描述投票：0回答：1

1个回答

最新问题

是否可以使用有限精度浮点运算基于ARM伪代码实现符合IEEE 754的浮点运算？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1