CUDA强制指令执行顺序

Question

我正在尝试将一些数据操作从CPU传输到GPU（CUDA），但有一小部分需要指令以特定顺序运行。原则上，我可以执行前几个并行部分，然后将结果传输到串行部分的主机，然后再次将其传输回其余的并行部分，但我试图避免内存传输开销。

计算的串行部分的形式为：

for (int i = 2; i<size; i++)
{
    result[i] = oldArray[i] + result[i-1];
}

除了在单个线程上启动内核来进行此计算之外，还有其他方法可以强制线程或计算按特定顺序运行吗？

编辑：

这个问题比我第一次展示的要复杂一些，据我所知，它不能作为前缀和问题起作用。

循环实际上采用以下形式：

for (int i = 2; i<size; i++)
{
    result[i] = oldArray[i] + k * result[i-1];
}

我一直在查看 Thrust 库的文档，但似乎没有解决方案。然而，我可能只是不明白我在看什么。此类问题有并行解决方案吗？

Answer 1

对于此类问题，我们可以给出的一种可能的描述是将它们归入递归关系的范畴。

原问题：

for (int i = 2; i<size; i++)
{
    result[i] = oldArray[i] + result[i-1];
}

如果需要遵循

这个问题/答案

中给出的描述，可以通过oldArray上的前缀和轻松解决。

编辑修改后：

for (int i = 2; i<size; i++)
{
    result[i] = oldArray[i] + k * result[i-1];
}

我们必须做额外的工作。参考之前链接的答案，在该答案的底部，参考了 Blelloch 的“本文”。如果我们研究该论文的第 1.4 节，我们可以观察到这个新问题的表述符合第 1.4.1 节中描述的“一阶递归”模式，特别是公式 1.5。如果我们仔细指定输入/输出数据以及扫描运算符，则可以证明如何使用“扫描”操作来实现该公式的解决方案。 Thrust 能够支持对所提供的基本扫描进行此类概括。该论文中s和

c

提到的对的集合可以实现为

thrust::tuple

，并且可以将特定算子传递给推力扫描操作，以概括操作行为。

我不会尝试在那篇论文中涵盖所有内容；我们大多只需要关注第 48 和 49 页上提供的材料。

接下来是一个使用推力的示例，证明我们可以完全按照论文中的描述使用推力扫描操作来解决这个问题。下面的代码注释了引用了 Blelloch 论文中特定公式的注释：

$ cat t1929.cu #include <iostream> #include <thrust/device_vector.h> #include <thrust/host_vector.h> #include <thrust/scan.h> #include <thrust/iterator/zip_iterator.h> #include <cstdlib> template <typename T> void cpufunction(T *result, T *oldArray, size_t size, T k){ for (int i = 1; i<size; i++) { result[i] = oldArray[i] + k * result[i-1]; } } struct scan_op // as per blelloch (1.7) { template <typename T1, typename T2> __host__ __device__ T1 operator()(const T1 &t1, const T2 &t2){ T1 ret; thrust::get<0>(ret) = thrust::get<0>(t1)*thrust::get<0>(t2); thrust::get<1>(ret) = thrust::get<1>(t1)*thrust::get<0>(t2)+thrust::get<1>(t2); return ret; } }; typedef float mt; const size_t ds = 1048576; const mt k = 1.01; int main(){ mt *b = new mt[ds]; // b as in blelloch (1.5) mt *a = new mt[ds]; // a as in blelloch (1.5) mt *cr = new mt[ds]; // cpu result for (int i = 0; i < ds; i++) { a[i] = k; b[i] = rand()/(float)RAND_MAX;} cr[0] = b[0]; cpufunction(cr, b, ds, k); for (int i = 0; i < 10; i++) std::cout << cr[i] << ","; std::cout << std::endl; thrust::device_vector<mt> db(b, b+ds); thrust::device_vector<mt> da(a, a+ds); thrust::device_vector<mt> dy(ds); thrust::device_vector<mt> dx(ds); thrust::inclusive_scan(thrust::make_zip_iterator(thrust::make_tuple(da.begin(), db.begin())), thrust::make_zip_iterator(thrust::make_tuple(da.end(), db.end())), thrust::make_zip_iterator(thrust::make_tuple(dy.begin(), dx.begin())), scan_op()); thrust::host_vector<mt> hx = dx; thrust::copy_n(hx.begin(), 10, std::ostream_iterator<mt>(std::cout, ",")); std::cout << std::endl; } $ nvcc -std=c++14 t1929.cu -o t1929 $ ./t1929 0.840188,1.24297,2.0385,2.85733,3.79755,4.03307,4.40863,5.22094,5.55093,6.16041, 0.840188,1.24297,2.0385,2.85733,3.79755,4.03307,4.40863,5.22094,5.55093,6.16041,

Blelloch 描述的一阶递归允许或多或少任意的

a

数组的可能性。在这个问题中，

数组简单地由

k

，

k

，

k

，...给出，我们可以通过消除

a

数组并将其替换为

thrust::constant_iterator

来进一步简化它。这个练习相当机械，留给读者。

CUDA强制指令执行顺序

问题描述投票：0回答：1

1个回答

最新问题

CUDA强制指令执行顺序

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1