在 AVX2 计数器函数上使用 `static` 可将 MT 环境中的性能提高约 10 倍，而编译器优化没有任何变化

Question

在 GCC 或 Clang 上使用

static inline

运行且至少具有 O1 时，该函数大约需要 20-35 毫秒（在 O0 上，与没有 static 关键字的情况相同，需要 400-600 毫秒），当删除 static 时，该函数需要 +400 毫秒才能执行具有 1bil 字节/字符的数组，当单线程时，无论是否与 static 一起使用，函数的时间都不会改变。在 MSVC 上，即使使用 O2i 和 avx2 arch，它也总是需要 400 或更多毫秒。

如果我只是用对 std::count(begin, end, target) 的简单调用替换 AVX2 代码，无论是否指定 static ，它都会像 AVX2 一样快地运行（即使这次在 MSVC 上）

代码：

static inline uint64_t opt_count(const char* begin, const char* end, const char target) noexcept {

    const __m256i avx2_Target = _mm256_set1_epi8(target);
    uint64_t result = 0;

    static __m256i cnk1, cnk2;
    static __m256i cmp1, cmp2;
    static uint32_t msk1, msk2;
    uint64_t cst;

    for (; begin < end; begin += 64) {
        cnk1 = _mm256_load_si256((const __m256i*)(begin));
        cnk2 = _mm256_load_si256((const __m256i*)(begin+32));

        cmp1 = _mm256_cmpeq_epi8(cnk1, avx2_Target);
        cmp2 = _mm256_cmpeq_epi8(cnk2, avx2_Target);

        msk1 = _mm256_movemask_epi8(cmp1);
        msk2 = _mm256_movemask_epi8(cmp2);
        // Casting and shifting is faster than 2 popcnt calls
        cst = static_cast<uint64_t>(msk2) << 32;
        result += _mm_popcnt_u64(msk1 | cst);
    }

    return result;
}

来电者：


uint64_t opt_count_parallel(const char* begin, const char* end, const char target) noexcept {
    const size_t num_threads = std::thread::hardware_concurrency()*2;
    const size_t total_length = end - begin;
    if (total_length < num_threads * 2) {
        return opt_count(begin, end, target);
    }

    const size_t chunk_size = (total_length + num_threads - 1) / num_threads;

    std::vector<std::future<uint64_t>> futures;
    futures.reserve(num_threads);

    for (size_t i = 0; i < num_threads; ++i) {
        const char* chunk_begin = begin + (i * chunk_size);
        const char* chunk_end = std::min(end, chunk_begin + chunk_size);

        futures.emplace_back(std::async(std::launch::async, opt_count, chunk_begin, chunk_end, target));
    }

    uint64_t total_count = 0;
    for (auto& future : futures) {
        total_count += future.get();
    }

    return total_count;
}

在另一个文件中，我用 new 分配一个缓冲区，将其对齐，memset '/n' 并将每个其他字符设置为 'x'，并对

opt_count_parallel

调用的每次迭代进行计时并打印其输出。

我尝试过使用 thread 和 future，两者的结果或多或少相同。

这是 godbolt 差异视图：https://godbolt.org/z/9P87bndsb ，我在装配中没有看到太大的差异，但我的知识不足以理解这些细微的差异

我还尝试在 opt_count_parallel 之外的 opt_count_parallel 中分配 avx2_Target ，这没有什么区别

我查看了 GCC 的 fopt-info ，两次输出都是相同的，我也尝试过强制内联和 noalign 但同样没有明显的差异

我也尝试过调试/分析，但它有点烦人，因为它的加速在 -O0 上丢失，并且分析只是显示一切都需要更长的时间来执行

Answer 1

更改这些：

    static __m256i cnk1, cnk2;
    static __m256i cmp1, cmp2;

至：

    __m256i cnk1, cnk2;
    __m256i cmp1, cmp2

解决了问题，谢谢哈罗德

在 AVX2 计数器函数上使用 `static` 可将 MT 环境中的性能提高约 10 倍，而编译器优化没有任何变化

问题描述投票：0回答：1

1个回答

最新问题

在 AVX2 计数器函数上使用 `static` 可将 MT 环境中的性能提高约 10 倍，而编译器优化没有任何变化

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1