我正在尝试使用OpenMP来对我实现的数据结构的速度进行基准测试。但是,我似乎犯了一个根本性的错误:无论我尝试进行哪种基准测试,吞吐量都随着线程数量的增加而不是减少。在下面,您可以看到尝试对for循环速度进行基准测试的代码,因此,我希望它可以随线程数线性地(某种程度上)扩展,而不能(在带有和不带有-的双核笔记本电脑上编译)使用c ++ 11在g ++上显示O3标志)。
#include <omp.h>
#include <atomic>
#include <chrono>
#include <iostream>
thread_local const int OPS = 10000;
thread_local const int TIMES = 200;
double get_tp(int THREADS)
{
double threadtime[THREADS] = {0};
//Repeat the test many times
for(int iteration = 0; iteration < TIMES; iteration++)
{
#pragma omp parallel num_threads(THREADS)
{
double start, stop;
int loc_ops = OPS/float(THREADS);
int t = omp_get_thread_num();
//Force all threads to start at the same time
#pragma omp barrier
start = omp_get_wtime();
//Do a certain kind of operations loc_ops times
for(int i = 0; i < loc_ops; i++)
{
//Here I would put the operations to benchmark
//in this case a boring for loop
int x = 0;
for(int j = 0; j < 1000; j++)
x++;
}
stop = omp_get_wtime();
threadtime[t] += stop-start;
}
}
double total_time = 0;
std::cout << "\nThread times: ";
for(int i = 0; i < THREADS; i++)
{
total_time += threadtime[i];
std::cout << threadtime[i] << ", ";
}
std::cout << "\nTotal time: " << total_time << "\n";
double mopss = float(OPS)*TIMES/total_time;
return mopss;
}
int main()
{
std::cout << "\n1 " << get_tp(1) << "ops/s\n";
std::cout << "\n2 " << get_tp(2) << "ops/s\n";
std::cout << "\n4 " << get_tp(4) << "ops/s\n";
std::cout << "\n8 " << get_tp(8) << "ops/s\n";
}
双核上带有-O3的输出,因此我们不希望吞吐量在2个线程之后增加,但是从1个线程变为2个线程时吞吐量甚至不会增加,它减少了50%:
1 Thread
Thread times: 7.411e-06,
Total time: 7.411e-06
2.69869e+11 ops/s
2 Threads
Thread times: 7.36701e-06, 7.38301e-06,
Total time: 1.475e-05
1.35593e+11ops/s
4 Threads
Thread times: 7.44301e-06, 8.31901e-06, 8.34001e-06, 7.498e-06,
Total time: 3.16e-05
6.32911e+10ops/s
8 Threads
Thread times: 7.885e-06, 8.18899e-06, 9.001e-06, 7.838e-06, 7.75799e-06, 7.783e-06, 8.349e-06, 8.855e-06,
Total time: 6.5658e-05
3.04609e+10ops/s
为了确保编译器不会删除循环,我还尝试在测量时间后尝试输出“ x”,并且据我所知,问题仍然存在。我还在具有更多内核的机器上尝试了该代码,并且其行为非常相似。如果没有-O3,那么吞吐量也不会扩展。因此,我的基准测试方法显然存在问题。我希望你能帮助我。
即使几乎在外部循环之后输出x
的值,也几乎可以肯定该循环仍在进行优化。编译器可以用一条指令简单地替换整个循环,因为循环边界在编译时是恒定的。确实,在this example中:
CPU总时间]的操作总数,然后对线程数的递减函数感到惊讶。几乎所有情况都是这种情况,除了出现缓存效果时。真正的性能指标是每壁钟时间