使用 SSE 和 NEON 进行外环矢量化

问题描述 投票:0回答:2

我想在 ARM NEON 和 SSE 上对以下循环进行矢量化:

for (int i = 0; i < n; ++i) {
   b[i][0] = 0.0;
   for (int j = 1; j < n; ++j) {
      b[i][j] = b[i][j - 1] + a[i][j]; 
   }
}

该循环具有循环携带依赖性,因此不能简单地向量化。相反,可以使用外循环矢量化进行矢量化。

假设

a
b
是 float32,通过并行运行内循环的 4 个实例来实现外循环矢量化函数。以下是矢量化循环的迭代:

First iteration:  { { i = 0, j = 1 }, { i = 1, j = 1 }, { i = 2, j = 1 }, { i = 3 , j = 1 } }
Second iteration: { { i = 0, j = 2 }, { i = 1, j = 2 }, { i = 2, j = 2 }, { i = 3 , j = 2 } }

Etc.

我想使用 NEON 和 SSE 向量化这个循环。它们都不支持聚集和分散,而这是简单矢量化此循环所必需的。您知道如何有效地矢量化这个循环吗?

vectorization simd sse neon
2个回答
1
投票

代码中的算法称为“独占前缀和”。

我不确定SIMD会带来多少利润,但是,关于SSE,请尝试以下版本。微内核:

// Given a vector of [ a, b, c, d ], compute a vector of [ a, a+b, a+b+c, a+b+c+d ]
// The function requires SSE 4.1 support
__m128 inclusivePrefixSum( __m128 v )
{
    // a*2, a+b, c*2, c+d
    __m128 tmp = _mm_add_ps( v, _mm_moveldup_ps( v ) );
    // a, a+b, c, c+d
    tmp = _mm_blend_ps( tmp, v, 0b0101 );

    // Broadcasted a+b
    __m128 res = _mm_shuffle_ps( tmp, tmp, _MM_SHUFFLE( 1, 1, 1, 1 ) );
    // a*2+b, (a+b)*2, a+b+c, a+b+c+d
    res = _mm_add_ps( res, tmp );
    // a, a+b, a+b+c, a+b+c+d
    res = _mm_blend_ps( res, tmp, 0b0011 );
    return res;
}

用途:

    const size_t innerEndAligned = 1 + ( ( n - 1 ) / 4 ) * 4;
    for( size_t i = 0; i < n; i++ )
    {
        float* const pa = a[ i ];
        float* const pb = b[ i ];

        __m128 acc = _mm_setzero_ps();
        _mm_store_ss( pb, acc );
        size_t j;
        for( j = 1; j < innerEndAligned; j += 4 )
        {
            // Load 4 elements
            __m128 vec = _mm_loadu_ps( pa + j );
            // Compute prefix sum
            vec = inclusivePrefixSum( vec );
            // Increment by the accumulator
            vec = _mm_add_ps( vec, acc );
            // Store 4 output elements
            _mm_storeu_ps( pb + j, vec );
            // Broadcast the last lane from the vector,
            // which contains cumulative sum of the row so far, into the accumulator
            acc = _mm_shuffle_ps( vec, vec, _MM_SHUFFLE( 3, 3, 3, 3 ) );
        }

        // Handle the last 0-3 elements of the row with scalar code
        for( ; j < n; j++ )
        {
            __m128 v = _mm_load_ss( pa + j );
            acc = _mm_add_ss( acc, v );
            _mm_store_ss( pb + j, acc );
        }
    }

0
投票

我目前对这个问题很感兴趣,想问一下这个最终实现了吗?

© www.soinside.com 2019 - 2024. All rights reserved.