我发现在 AWS EC2 实例(Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz)上调用 boost::dynamic_bitset 的 [] 运算符的计算时间存在变化。在下面的代码中,当定义 DO_COUNT_IN_FUNC 时,调用大约需要 370 毫秒。然而,当未定义 DO_COUNT_IN_FUNC 时,调用大约需要 1070 毫秒。
尝试了以下编译器组合。所有情况下的结果都是相似的。
g++ -o perfTestGetDB -O2 -std=c++17 -DDO_COUNT_IN_FUNC perfTestGetDB.cpp
g++ -o perfTestGetDB -O2 -std=c++17 perfTestGetDB.cpp
这是代码。
#include <iostream>
#include <chrono>
#include <random>
#include <boost/dynamic_bitset.hpp>
const int numBits = 1536;
const int setSize = 500;
const int searchSize = 1000000;
void perfTestCountAndGet()
{
#ifdef DO_COUNT_IN_FUNC
boost::dynamic_bitset<> db1(numBits);
#endif
boost::dynamic_bitset<> db2(numBits);
int searchList[setSize];
// Use the current time as a seed for the random number generator
unsigned seed = std::chrono::system_clock::now().time_since_epoch().count();
std::mt19937 generator(seed); // Mersenne Twister 19937 generator
// Define the range of random numbers (0 to numBits - 1)
std::uniform_int_distribution<int> distribution(0, numBits - 1);
for (int i = 0; i < setSize; i++) {
int idx = distribution(generator);
#ifdef DO_COUNT_IN_FUNC
db1.set(idx);
#endif
db2.set(idx);
}
// Define another random search index
for (int i = 0; i < setSize; i++) {
searchList[i] = distribution(generator);
}
#ifdef DO_COUNT_IN_FUNC
// Perform count
int count = db1.count();
std::cout << "Dummy print to prvent optimizing out of the code:" << count << std::endl << std::endl;
#endif
// Perform the [] operation on dynamic_bitset randonly for setSize
// for loopSize iterations
auto start2 = std::chrono::high_resolution_clock::now();
long int result2 = 0;
for (int l = 0; l < searchSize; l++) {
for (int idx = 0; idx < setSize; idx++) {
bool bit = db2[searchList[idx]];
if (bit) {
result2 += idx; // Use the bit value in a computation
}
}
}
auto end2 = std::chrono::high_resolution_clock::now();
auto duration2 = std::chrono::duration_cast<std::chrono::microseconds>(end2 - start2);
std::cout << "dynamic_bitset,[]," << duration2.count() / 1000.0L << std::endl;
std::cout << "Dummy print to prevent optimizing out of the code:" << result2 << std::endl << std::endl;
}
int main()
{
#ifndef DO_COUNT_IN_FUNC
std::cout << "Check count in main" << std::endl;
boost::dynamic_bitset<> db1(numBits);
int count = db1.count();
std::cout << "Dummy print to prevent optimizing out of the code:" << count << std::endl << std::endl;
#endif
perfTestCountAndGet();
}
出于好奇,我想将其简化为适当的微基准:
#define NONIUS_RUNNER
#include <boost/dynamic_bitset.hpp>
#include <chrono>
#include <iostream>
#include <nonius/benchmark.h++>
#include <nonius/main.h++>
#include <random>
using namespace std::chrono_literals;
static constexpr int numBits = 1'536;
static constexpr int setSize = 500;
static constexpr int searchSize = 1'000'000;
static constexpr auto now = std::chrono::high_resolution_clock::now;
template <bool DoHack = false>
size_t perfTestCountAndGet() {
boost::dynamic_bitset<> bits(numBits);
int searchList[setSize];
auto dist = bind(std::uniform_int_distribution<size_t>(0, numBits - 1), //
std::mt19937(std::random_device{}()));
for (size_t i = 0; i < setSize; i++) {
int idx = dist();
bits.set(idx);
}
// Define another random search index
std::generate_n(searchList, setSize, dist);
size_t proof_of_work = 0;
if constexpr (DoHack) {
auto db1 = bits; // why?!
proof_of_work += db1.count();
}
// Perform the [] operation on dynamic_bitset randonly for setSize
// for loopSize iterations
auto start = now();
for (int l = 0; l < searchSize; l++) {
for (int idx = 0; idx < setSize; idx++) {
bool bit = bits[searchList[idx]];
if (bit) {
proof_of_work += idx; // Use the bit value in a computation
}
}
}
auto d = now() - start;
std::cout << "dynamic_bitset,[]," << d / 1.ms << std::endl;
return proof_of_work ;
}
NONIUS_BENCHMARK("normal", [] { return perfTestCountAndGet<false>(); })
NONIUS_BENCHMARK("with_hack", [] { return perfTestCountAndGet<true>(); })
这确实为
DoHack = true
提供了显着的优势,这有点令人惊讶,因为您甚至没有“预热”位集的缓存,因为它计算 copy 中的位(db1
):
现在,我不得不承认我不知道测试实际上代表什么。所以看看测试的核心:
for (int l = 0; l < searchSize; l++) {
for (int idx = 0; idx < setSize; idx++) {
bool bit = bits[searchList[idx]];
if (bit)
proof_of_work += idx; // Use the bit value in a computation
}
}
我发现了一些奇怪的事情:
l
没有使用只是迭代idx
正在迭代 searchList
searchList
的索引。这看起来像是一个错误,因为它可能使事情变得非常可预测。事实上,我们可以做一次内循环,然后做proof_of_work += searchSize * (sum_of_inner_loop)
而不是迭代。所以让我们稍微改变一下:
searchList[idx]
而不是 idx
db1
副本(这很可爱,因为它说明了它本身不是缓存/预取,但它仍然是一个无用的副本)这显着简化了代码
#define NONIUS_RUNNER
#include <boost/dynamic_bitset.hpp>
#include <nonius/benchmark.h++>
#include <nonius/main.h++>
#include <random>
static constexpr int numBits = 1'536;
static constexpr int setSize = 500;
template <bool DoHack> //
size_t perfTestCountAndGet(nonius::chronometer& cm) {
boost::dynamic_bitset<> bits(numBits);
int searchList[setSize];
auto dist = bind(std::uniform_int_distribution<size_t>(0, numBits - 1), //
std::mt19937(std::random_device{}()));
for (size_t i = 0; i < setSize; i++) {
int idx = dist();
bits.set(idx);
}
// Define another random search index
std::generate_n(searchList, setSize, dist);
size_t proof_of_work = DoHack ? bits.count() : 0;
cm.measure([&] {
for (auto probe : searchList)
if (bool bit = bits[probe])
proof_of_work += probe;
});
return proof_of_work;
}
NONIUS_BENCHMARK("normal", [](nonius::chronometer cm) { return perfTestCountAndGet<false>(cm); })
NONIUS_BENCHMARK("with_hack", [](nonius::chronometer cm) { return perfTestCountAndGet<true>(cm); })
这仍然显示出优势:
但是,更改基准的顺序会扭转这种情况!
在极端情况下,我想排除创建和重新播种 Mersenne Twister 的影响:
static auto dist = bind(std::uniform_int_distribution<size_t>(0, numBits - 1), //
std::mt19937(std::random_device{}()));
template <bool DoHack> //
size_t perfTestCountAndGet(nonius::chronometer& cm) {
boost::dynamic_bitset<> bits(numBits);
for (size_t i = 0; i < setSize; i++)
bits.set(dist());
// Define another random search index
int searchList[setSize];
std::generate_n(searchList, setSize, dist);
size_t proof_of_work = DoHack ? bits.count() : 0;
cm.measure([&] {
for (auto probe : searchList)
if (bool bit = bits[probe])
proof_of_work += probe;
});
return proof_of_work;
}
现在差异已经消失了。稍微增加样本量以降低异常值的影响:
static constexpr int numBits = 15'360;
static constexpr int setSize = 5'000;
#define NONIUS_RUNNER
#include <boost/dynamic_bitset.hpp>
#include <nonius/benchmark.h++>
#include <nonius/main.h++>
#include <random>
static constexpr int numBits = 15'360;
static constexpr int setSize = 5'000;
static auto dist = bind(std::uniform_int_distribution<size_t>(0, numBits - 1), //
std::mt19937(std::random_device{}()));
template <bool DoHack> //
size_t perfTestCountAndGet(nonius::chronometer& cm) {
boost::dynamic_bitset<> bits(numBits);
for (size_t i = 0; i < setSize; i++)
bits.set(dist());
// Define another random search index
int searchList[setSize];
std::generate_n(searchList, setSize, dist);
size_t proof_of_work = DoHack ? bits.count() : 0;
cm.measure([&] {
for (auto probe : searchList)
if (bool bit = bits[probe])
proof_of_work += probe;
});
return proof_of_work;
}
NONIUS_BENCHMARK("normal", [](nonius::chronometer cm) { return perfTestCountAndGet<false>(cm); })
NONIUS_BENCHMARK("with_hack", [](nonius::chronometer cm) { return perfTestCountAndGet<true>(cm); })
神智恢复