具有不同调用顺序的 boost::dynamic_bitset 的 [] 运算符的计算时间存在差异

问题描述 投票:0回答:1

我发现在 AWS EC2 实例(Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz)上调用 boost::dynamic_bitset 的 [] 运算符的计算时间存在变化。在下面的代码中,当定义 DO_COUNT_IN_FUNC 时,调用大约需要 370 毫秒。然而,当未定义 DO_COUNT_IN_FUNC 时,调用大约需要 1070 毫秒。

尝试了以下编译器组合。所有情况下的结果都是相似的。

g++ -o perfTestGetDB -O2 -std=c++17 -DDO_COUNT_IN_FUNC perfTestGetDB.cpp
g++ -o perfTestGetDB -O2 -std=c++17 perfTestGetDB.cpp

这是代码。

#include <iostream>
#include <chrono>
#include <random>
#include <boost/dynamic_bitset.hpp>

const int numBits = 1536;
const int setSize = 500;
const int searchSize = 1000000;

void perfTestCountAndGet()
{
#ifdef DO_COUNT_IN_FUNC
    boost::dynamic_bitset<> db1(numBits);
#endif
    boost::dynamic_bitset<> db2(numBits);
    int searchList[setSize];

    // Use the current time as a seed for the random number generator
    unsigned seed = std::chrono::system_clock::now().time_since_epoch().count();
    std::mt19937 generator(seed); // Mersenne Twister 19937 generator

    // Define the range of random numbers (0 to numBits - 1)
    std::uniform_int_distribution<int> distribution(0, numBits - 1);
    for (int i = 0; i < setSize; i++) {
        int idx = distribution(generator);
#ifdef DO_COUNT_IN_FUNC
        db1.set(idx);
#endif
        db2.set(idx);
    }

    // Define another random search index
    for (int i = 0; i < setSize; i++) {
        searchList[i] = distribution(generator);
    }

#ifdef DO_COUNT_IN_FUNC
    // Perform count
    int count = db1.count();
    std::cout << "Dummy print to prvent optimizing out of the code:" << count << std::endl << std::endl;
#endif

    // Perform the [] operation on dynamic_bitset randonly for setSize
    // for loopSize iterations
    auto start2 = std::chrono::high_resolution_clock::now();
    long int result2 = 0;
    for (int l = 0; l < searchSize; l++) {
        for (int idx = 0; idx < setSize; idx++) {
            bool bit = db2[searchList[idx]];
            if (bit) {
               result2 += idx; // Use the bit value in a computation
            }
        }
    }
    auto end2 = std::chrono::high_resolution_clock::now();
    auto duration2 = std::chrono::duration_cast<std::chrono::microseconds>(end2 - start2);
    std::cout << "dynamic_bitset,[]," << duration2.count() / 1000.0L << std::endl;
    std::cout << "Dummy print to prevent optimizing out of the code:" << result2 << std::endl << std::endl;
}

int main()
{
#ifndef DO_COUNT_IN_FUNC
    std::cout << "Check count in main" << std::endl;
    boost::dynamic_bitset<> db1(numBits);
    int count = db1.count();
    std::cout << "Dummy print to prevent optimizing out of the code:" << count << std::endl << std::endl;
#endif
    perfTestCountAndGet();
}
c++ boost
1个回答
0
投票

出于好奇,我想将其简化为适当的微基准:

#define NONIUS_RUNNER
#include <boost/dynamic_bitset.hpp>
#include <chrono>
#include <iostream>
#include <nonius/benchmark.h++>
#include <nonius/main.h++>
#include <random>
using namespace std::chrono_literals;

static constexpr int  numBits    = 1'536;
static constexpr int  setSize    = 500;
static constexpr int  searchSize = 1'000'000;
static constexpr auto now        = std::chrono::high_resolution_clock::now;

template <bool DoHack = false>
size_t perfTestCountAndGet() {
    boost::dynamic_bitset<> bits(numBits);
    int                     searchList[setSize];

    auto dist = bind(std::uniform_int_distribution<size_t>(0, numBits - 1), //
                     std::mt19937(std::random_device{}()));

    for (size_t i = 0; i < setSize; i++) {
        int idx = dist();
        bits.set(idx);
    }

    // Define another random search index
    std::generate_n(searchList, setSize, dist);

    size_t proof_of_work = 0;
    if constexpr (DoHack) {
        auto db1 = bits; // why?!
        proof_of_work += db1.count();
    }

    // Perform the [] operation on dynamic_bitset randonly for setSize
    // for loopSize iterations
    auto start = now();
    for (int l = 0; l < searchSize; l++) {
        for (int idx = 0; idx < setSize; idx++) {
            bool bit = bits[searchList[idx]];
            if (bit) {
                proof_of_work += idx; // Use the bit value in a computation
            }
        }
    }
    auto d = now() - start;
    std::cout << "dynamic_bitset,[]," << d / 1.ms << std::endl;
    return proof_of_work ;
}

NONIUS_BENCHMARK("normal", [] { return perfTestCountAndGet<false>(); })
NONIUS_BENCHMARK("with_hack", [] { return perfTestCountAndGet<true>(); })

这确实为

DoHack = true
提供了显着的优势,这有点令人惊讶,因为您甚至没有“预热”位集的缓存,因为它计算 copy 中的位(
db1
):

enter image description here

现在,我不得不承认我不知道测试实际上代表什么。所以看看测试的核心:

for (int l = 0; l < searchSize; l++) {
    for (int idx = 0; idx < setSize; idx++) {
        bool bit = bits[searchList[idx]];
        if (bit)
            proof_of_work += idx; // Use the bit value in a computation
    }
}

我发现了一些奇怪的事情:

  • l
    没有使用只是迭代
  • idx
    正在迭代
    searchList
  • 中的所有索引
  • 但是,计算中使用的不是那个idx,而是......
    searchList
    的索引。这看起来像是一个错误,因为它可能使事情变得非常可预测。事实上,我们可以做一次内循环,然后做
    proof_of_work += searchSize * (sum_of_inner_loop)
    而不是迭代。

所以让我们稍微改变一下:

  1. 将迭代留给基准统计分析
  2. 计时的东西同上
  3. 实际上在计算中使用
    searchList[idx]
    而不是
    idx
  4. 最后,剪切
    db1
    副本(这很可爱,因为它说明了它本身不是缓存/预取,但它仍然是一个无用的副本)

这显着简化了代码

#define NONIUS_RUNNER
#include <boost/dynamic_bitset.hpp>
#include <nonius/benchmark.h++>
#include <nonius/main.h++>
#include <random>

static constexpr int numBits = 1'536;
static constexpr int setSize = 500;

template <bool DoHack> //
size_t perfTestCountAndGet(nonius::chronometer& cm) {
    boost::dynamic_bitset<> bits(numBits);
    int                     searchList[setSize];

    auto dist = bind(std::uniform_int_distribution<size_t>(0, numBits - 1), //
                     std::mt19937(std::random_device{}()));

    for (size_t i = 0; i < setSize; i++) {
        int idx = dist();
        bits.set(idx);
    }

    // Define another random search index
    std::generate_n(searchList, setSize, dist);

    size_t proof_of_work = DoHack ? bits.count() : 0;

    cm.measure([&] {
        for (auto probe : searchList)
            if (bool bit = bits[probe])
                proof_of_work += probe;
    });
    return proof_of_work;
}

NONIUS_BENCHMARK("normal", [](nonius::chronometer cm) { return perfTestCountAndGet<false>(cm); })
NONIUS_BENCHMARK("with_hack", [](nonius::chronometer cm) { return perfTestCountAndGet<true>(cm); })

这仍然显示出优势:

enter image description here

但是,更改基准的顺序会扭转这种情况!

enter image description here

在极端情况下,我想排除创建和重新播种 Mersenne Twister 的影响:

static auto dist = bind(std::uniform_int_distribution<size_t>(0, numBits - 1), //
                        std::mt19937(std::random_device{}()));

template <bool DoHack> //
size_t perfTestCountAndGet(nonius::chronometer& cm) {
    boost::dynamic_bitset<> bits(numBits);
    for (size_t i = 0; i < setSize; i++)
        bits.set(dist());

    // Define another random search index
    int searchList[setSize];
    std::generate_n(searchList, setSize, dist);

    size_t proof_of_work = DoHack ? bits.count() : 0;

    cm.measure([&] {
        for (auto probe : searchList)
            if (bool bit = bits[probe])
                proof_of_work += probe;
    });
    return proof_of_work;
}

现在差异已经消失了。稍微增加样本量以降低异常值的影响:

static constexpr int numBits = 15'360;
static constexpr int setSize =  5'000;

最终演示

#define NONIUS_RUNNER
#include <boost/dynamic_bitset.hpp>
#include <nonius/benchmark.h++>
#include <nonius/main.h++>
#include <random>

static constexpr int numBits = 15'360;
static constexpr int setSize =  5'000;

static auto dist = bind(std::uniform_int_distribution<size_t>(0, numBits - 1), //
                        std::mt19937(std::random_device{}()));

template <bool DoHack> //
size_t perfTestCountAndGet(nonius::chronometer& cm) {
    boost::dynamic_bitset<> bits(numBits);
    for (size_t i = 0; i < setSize; i++)
        bits.set(dist());

    // Define another random search index
    int searchList[setSize];
    std::generate_n(searchList, setSize, dist);

    size_t proof_of_work = DoHack ? bits.count() : 0;

    cm.measure([&] {
        for (auto probe : searchList)
            if (bool bit = bits[probe])
                proof_of_work += probe;
    });
    return proof_of_work;
}

NONIUS_BENCHMARK("normal", [](nonius::chronometer cm) { return perfTestCountAndGet<false>(cm); })
NONIUS_BENCHMARK("with_hack", [](nonius::chronometer cm) { return perfTestCountAndGet<true>(cm); })

enter image description here

enter image description here

神智恢复

© www.soinside.com 2019 - 2024. All rights reserved.