平凡并发代码的吞吐量不会随着线程数的增加而增加

Question

我正在尝试使用OpenMP来对我实现的数据结构的速度进行基准测试。但是，我似乎犯了一个根本性的错误：无论我尝试进行哪种基准测试，吞吐量都随着线程数量的增加而不是减少。在下面，您可以看到尝试对for循环速度进行基准测试的代码，因此，我希望它可以随线程数线性地（某种程度上）扩展，而不能（在带有和不带有-的双核笔记本电脑上编译）使用c ++ 11在g ++上显示O3标志）。

#include <omp.h>
#include <atomic>
#include <chrono>
#include <iostream>

thread_local const int OPS = 10000;
thread_local const int TIMES = 200;

double get_tp(int THREADS)
{
    double threadtime[THREADS] = {0};

    //Repeat the test many times
    for(int iteration = 0; iteration < TIMES; iteration++)
    {
        #pragma  omp  parallel num_threads(THREADS)
        {
            double start, stop;
            int loc_ops = OPS/float(THREADS);
            int t = omp_get_thread_num();

            //Force all threads to start at the same time
            #pragma  omp  barrier
            start = omp_get_wtime();


            //Do a certain kind of operations loc_ops times
            for(int i = 0; i < loc_ops; i++)
            {
                //Here I would put the operations to benchmark
                //in this case a boring for loop
                int x = 0;
                for(int j = 0; j < 1000; j++)
                    x++;
            }

        stop = omp_get_wtime();
        threadtime[t] += stop-start;
        }
    }

    double total_time = 0;
    std::cout << "\nThread times: ";
    for(int i = 0; i < THREADS; i++)
    {
        total_time += threadtime[i];
        std::cout << threadtime[i] << ", ";
    }
    std::cout << "\nTotal time: " << total_time << "\n";
    double mopss = float(OPS)*TIMES/total_time;
    return mopss;
}

int main()
{
    std::cout << "\n1  " << get_tp(1) << "ops/s\n";
    std::cout << "\n2  " << get_tp(2) << "ops/s\n";
    std::cout << "\n4  " << get_tp(4) << "ops/s\n";
    std::cout << "\n8  " << get_tp(8) << "ops/s\n";
}

双核上带有-O3的输出，因此我们不希望吞吐量在2个线程之后增加，但是从1个线程变为2个线程时吞吐量甚至不会增加，它减少了50％：

1 Thread 
Thread times: 7.411e-06, 
Total time: 7.411e-06
2.69869e+11 ops/s

2 Threads 
Thread times: 7.36701e-06, 7.38301e-06, 
Total time: 1.475e-05
1.35593e+11ops/s

4 Threads 
Thread times: 7.44301e-06, 8.31901e-06, 8.34001e-06, 7.498e-06, 
Total time: 3.16e-05
6.32911e+10ops/s

8 Threads 
Thread times: 7.885e-06, 8.18899e-06, 9.001e-06, 7.838e-06, 7.75799e-06, 7.783e-06, 8.349e-06, 8.855e-06, 
Total time: 6.5658e-05
3.04609e+10ops/s

为了确保编译器不会删除循环，我还尝试在测量时间后尝试输出“ x”，并且据我所知，问题仍然存在。我还在具有更多内核的机器上尝试了该代码，并且其行为非常相似。如果没有-O3，那么吞吐量也不会扩展。因此，我的基准测试方法显然存在问题。我希望你能帮助我。

Answer 1

即使几乎在外部循环之后输出x的值，也几乎可以肯定该循环仍在进行优化。编译器可以用一条指令简单地替换整个循环，因为循环边界在编译时是恒定的。确实，在this example中：

Answer 2

我不确定为什么将性能定义为每

CPU总时间]的操作总数，然后对线程数的递减函数感到惊讶。几乎所有情况都是这种情况，除了出现缓存效果时。真正的性能指标是每壁钟时间

平凡并发代码的吞吐量不会随着线程数的增加而增加

问题描述投票：0回答：2

2个回答

最新问题

平凡并发代码的吞吐量不会随着线程数的增加而增加

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2