使用SSE的自然指数函数的最快实现

Question

我正在寻找在SSE元素上运行的自然指数函数的近似值。即-__m128 exp( __m128 x )。

我的执行速度很快，但准确性似乎很低：

static inline __m128 FastExpSse(__m128 x)
{
    __m128 a = _mm_set1_ps(12102203.2f); // (1 << 23) / ln(2)
    __m128i b = _mm_set1_epi32(127 * (1 << 23) - 486411);
    __m128  m87 = _mm_set1_ps(-87);
    // fast exponential function, x should be in [-87, 87]
    __m128 mask = _mm_cmpge_ps(x, m87);

    __m128i tmp = _mm_add_epi32(_mm_cvtps_epi32(_mm_mul_ps(a, x)), b);
    return _mm_and_ps(_mm_castsi128_ps(tmp), mask);
}

有人能以更快（或更快）的准确度实现更好的实现吗？

如果使用C风格编写，我会很高兴。

谢谢。

Answer 1

__m128 BetterFastExpSse (__m128 x) { const __m128 a = _mm_set1_ps ((1 << 22) / float(M_LN2)); // to get exp(x/2) const __m128i b = _mm_set1_epi32 (127 * (1 << 23)); // NB: zero shift! __m128i r = _mm_cvtps_epi32 (_mm_mul_ps (a, x)); __m128i s = _mm_add_epi32 (b, r); __m128i t = _mm_sub_epi32 (b, r); return _mm_div_ps (_mm_castsi128_ps (s), _mm_castsi128_ps (t)); }

（（我不是硬件专家，这里的性能杀手有多糟？）

Answer 2

立方响应四次版本给您4个响应。 5位有效数字。没有必要再增加阶次，因为低精度算术的噪声随后开始淹没多项式逼近的误差。这是普通的C版本：

#include <stdint.h> float fastExp3(register float x) // cubic spline approximation { union { float f; int32_t i; } reinterpreter; reinterpreter.i = (int32_t)(12102203.0f*x) + 127*(1 << 23); int32_t m = (reinterpreter.i >> 7) & 0xFFFF; // copy mantissa // empirical values for small maximum relative error (8.34e-5): reinterpreter.i += ((((((((1277*m) >> 14) + 14825)*m) >> 14) - 79749)*m) >> 11) - 626; return reinterpreter.f; } float fastExp4(register float x) // quartic spline approximation { union { float f; int32_t i; } reinterpreter; reinterpreter.i = (int32_t)(12102203.0f*x) + 127*(1 << 23); int32_t m = (reinterpreter.i >> 7) & 0xFFFF; // copy mantissa // empirical values for small maximum relative error (1.21e-5): reinterpreter.i += (((((((((((3537*m) >> 16) + 13668)*m) >> 18) + 15817)*m) >> 14) - 80470)*m) >> 11); return reinterpreter.f; }

Answer 3

http://ijeais.org/wp-content/uploads/2018/07/IJAER180702.pdf“创建英特尔Svml Simd内部函数的编译器优化的嵌入式实现”

其tanh方程6，在第9页上与@NicSchraudolph答案非常相似

使用SSE的自然指数函数的最快实现

问题描述投票：14回答：4

4个回答

最新问题

使用SSE的自然指数函数的最快实现

问题描述 投票：14回答：4

4个回答

最新问题

问题描述投票：14回答：4