使用 nvc++ 的标准 C++ 并行性很慢

Question

我不确定我做错了什么，但使用

std::sort

似乎比使用

nvc++ -stdpar

慢得多。其他 std 函数更好，但永远不会比多线程 CPU 版本更好。

下面是代码，其中 TIMEIT 只是一个计算前后计时器的宏。

g++

结果是

using Duration = std::chrono::duration<double, std::milli>;
std::random_device rd;
std::uniform_real_distribution<> dist(1,1000);

int main(){

int n=1<<21;
std::vector<float> v(n);

std::generate(v.begin(),v.end(),[&](){ return dist(rd);});
Duration d;
TIMEIT(d,
std::sort(std::execution::par,v.begin(),v.end());
)
std::cout<<d.count()<<"\n";

217.164

nvc++ -stdpar std.cu

69.9

请注意，

nvc++

会发出有关 cc 小于 70 的顺序运行的警告，但当我尝试不使用 -stdpar 时，时间是 756，所以我猜测它正在并行化。如果没有，我不知道如何强制它并行化。 g++ -O2 std.cpp -ltbb

Answer 1

使用 nvc++ 的标准 C++ 并行性一点也不慢

我发现您的测试有两个问题：

您的硬件似乎不合适（您的 GPU 不太好）。

测试未正确执行。您应该删除将数据从主机传输到 GPU 所需的时间。

nvc++ 22.3 on WSL2 on Windows 11 GPU: GeForce GTX 1660 Ti with Max-Q Design, 6GB RAM CPU: Intel I7-1065G7, 32 GB RAM

编译器是

CPU: two Intel Xeon(R) CPU E5-2699 v4 @ 2.20GHz.(44 physical cores and 64 GB RAM) GPU: just one NVIDIA NGeForce RTX 3070 Ti (8 GB)

两者都是很棒的编译器（以我的愚见）。

为了清楚起见，我在下面发布了完整的代码，以便每个人都可以完全按照我的意图重现它。

nvc++ 23.1-0 64-bit target on x86-64 Linux g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

我的结果是

#include <sys/time.h> #include <chrono> #include <iostream> #include <vector> #include <algorithm> #include <numeric> #include <execution> #include <random> #include <iterator> std::random_device rd; std::uniform_real_distribution<> dist(1,1000); int main(){ int n=1<<21; std::vector<float> v(n); std::generate(v.begin(),v.end(),[&](){ return dist(rd);}); { auto start = std::chrono::high_resolution_clock::now(); std::sort(std::execution::par,v.begin(),v.end()); auto stop = std::chrono::high_resolution_clock::now(); auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start); std::cout << "Time elapsed: " << duration.count() << " microseconds" << std::endl; } return 1; }

nvc++ -stdpar=multicore std.cpp : 84962 microseconds

g++ -O2 std.cpp -ltbb : 17130 microseconds

因此，当我在 GPU 上运行代码时，nvc++ 轻而易举地获胜，但在多核中则输给了 g++。这是我在许多其他项目中观察到的一种趋势：对于多核，GNU 编译器比 NVIDIA 编译器（基本上是 PGI）更好。但也有可能他们只是以不同的方式使用资源。长话短说，GPU 在这里获胜。

GPU 即使处于劣势也能获胜

。出于好奇，我们在 std::generate 之后添加一行 nvc++ -stdpar=gpu std.cpp : 3824 microseconds

我们没有将这个 for_each 计算在经过的时间中。我们只是用它来将内存从主机移动到 GPU，这样当我们在 GPU 上调用 std::sort 时，内存就已经在那里，不需要复制（统一内存管理的魔力）。这样，我们只是统计GPU运行排序算法所需的时间。

结果是

std::for_each(std::execution::par,v.begin(),v.end(),[](auto& it) { it *= 2.;});

因此，GPU 上的 nvc++ 标准并行性比多核上的 g++ 并行性快约 28 倍（比 nvc++ 多核更好）。

使用 nvc++ 的标准 C++ 并行性很慢

问题描述投票：0回答：1

1个回答

最新问题

使用 nvc++ 的标准 C++ 并行性很慢

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1