基于 OpenMP 的 Pytorch 与 CUDA 扩展

问题描述 投票:0回答:1

我尝试使用 libtorchOpenMP 开发 pytorch 扩展。 当我测试我的代码时,它在 CPU 模型中运行良好,大约需要 1 秒才能完成所有操作:

s = time.time()
adj_matrices = batched_natural_neighbor_edges(x) # x is a tensor from torch.Tensor
print(time.time() - s)

输出:

1.2259256839752197

看起来一切都很顺利,持续时间也很正常。但是当我使用

torch.to
将张量移动到 GPU 并再次执行此操作后:

x = x.to('cuda')
s = time.time()
adj_matrices = batched_natural_neighbor_edges(x) # x is a tensor from torch.Tensor
print(time.time() - s)

输出:

16.806606769561768

GPU 中并行运算似乎不起作用。

我的 OpenMP 的 C++ 扩展和代码如下,natural_neighbor.cpp:

torch::Tensor cal_adj_matrix(const torch::Tensor &point_cloud, const torch::Tensor &knn_indices)
{
    int point_num = point_cloud.size(0);
    std::vector<std::set<int>> knn_idx(point_num, std::set<int>());
    std::vector<std::set<int>> reversed_knn_idx(point_num, std::set<int>());
    init_knn_and_reverse_knn(knn_indices, knn_idx, reversed_knn_idx, point_num);
    torch::Tensor adj_matrix = torch::zeros({point_num, point_num});
    std::vector<torch::Tensor> targets(point_num);
    std::vector<torch::Tensor> neighbors(point_num);
    /** 
     * here is the code for OpenMP
     **/
#pragma omp target teams distribute parallel for default(none) shared(point_num, knn_idx, reversed_knn_idx, adj_matrix, targets, neighbors)
    for (int i = 0; i < point_num; i++)
    {
        std::vector<int> temp_neighbor_idx;
        std::set_intersection(knn_idx[i].begin(),
                              knn_idx[i].end(),
                              reversed_knn_idx[i].begin(),
                              reversed_knn_idx[i].end(),
                              std::back_inserter(temp_neighbor_idx));
        neighbors[i] = torch::tensor(temp_neighbor_idx);
        targets[i] = torch::full(neighbors[i].size(0), i);
        adj_matrix[i][i] = 1;
    }
    torch::Tensor target_idx = torch::cat(targets, -1);
    torch::Tensor neighbor_idx = torch::cat(neighbors, -1);
    adj_matrix = adj_matrix.index_put({target_idx, neighbor_idx}, torch::ones(1));
    return adj_matrix;
}

cpp 扩展setup.py:

setup(
    # package name
    name='torch_cpp_extension',
    # module version
    version="0.1",
    # cpp file path & module names
    ext_modules=[
        CppExtension(name='natural_neighbor', sources=['src/natural_neighbor.cpp'],
                     extra_compile_args=['-fopenmp'] # here is compilation flag)
    ],
    cmdclass={
        'build_ext': BuildExtension
    }
)

总结:

  1. 如何在 GPU 上使用 OpenMP?有没有我忘记的必要编译标志?
  2. 我的代码中OpenMP相关的宏定义是否正确?
  3. 遍历张量时有没有更好的方法来加速
    for
    迭代?

我的环境:

  • 操作系统:Ubuntu 18.04.2 LTS(GNU/Linux 5.4.0-90-generic x86_64)
  • 编译器:我不确定我使用哪个编译器。编译过程是用
    torch.utils.cpp_extension
    完成的,我只使用命令
    python setup.py install
    。系统中其他编译器是 gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
c++ pytorch openmp libtorch
1个回答
0
投票

我的猜测是你的 openmp 基本上发送到 cpu,然后再次发送到 GPU,造成速度减慢。仔细检查集成是否正常工作。

第二个选项是 std 操作在 GPU 上的 std::set_intersection 上速度很慢,我不确定它是如何为 GPU 编译的。再次,你可能会将东西移至CPU,这会减慢你的速度

© www.soinside.com 2019 - 2024. All rights reserved.