我尝试使用 libtorch 和 OpenMP 开发 pytorch 扩展。 当我测试我的代码时,它在 CPU 模型中运行良好,大约需要 1 秒才能完成所有操作:
s = time.time()
adj_matrices = batched_natural_neighbor_edges(x) # x is a tensor from torch.Tensor
print(time.time() - s)
输出:
1.2259256839752197
看起来一切都很顺利,持续时间也很正常。但是当我使用
torch.to
将张量移动到 GPU 并再次执行此操作后:
x = x.to('cuda')
s = time.time()
adj_matrices = batched_natural_neighbor_edges(x) # x is a tensor from torch.Tensor
print(time.time() - s)
输出:
16.806606769561768
GPU 中并行运算似乎不起作用。
我的 OpenMP 的 C++ 扩展和代码如下,natural_neighbor.cpp:
torch::Tensor cal_adj_matrix(const torch::Tensor &point_cloud, const torch::Tensor &knn_indices)
{
int point_num = point_cloud.size(0);
std::vector<std::set<int>> knn_idx(point_num, std::set<int>());
std::vector<std::set<int>> reversed_knn_idx(point_num, std::set<int>());
init_knn_and_reverse_knn(knn_indices, knn_idx, reversed_knn_idx, point_num);
torch::Tensor adj_matrix = torch::zeros({point_num, point_num});
std::vector<torch::Tensor> targets(point_num);
std::vector<torch::Tensor> neighbors(point_num);
/**
* here is the code for OpenMP
**/
#pragma omp target teams distribute parallel for default(none) shared(point_num, knn_idx, reversed_knn_idx, adj_matrix, targets, neighbors)
for (int i = 0; i < point_num; i++)
{
std::vector<int> temp_neighbor_idx;
std::set_intersection(knn_idx[i].begin(),
knn_idx[i].end(),
reversed_knn_idx[i].begin(),
reversed_knn_idx[i].end(),
std::back_inserter(temp_neighbor_idx));
neighbors[i] = torch::tensor(temp_neighbor_idx);
targets[i] = torch::full(neighbors[i].size(0), i);
adj_matrix[i][i] = 1;
}
torch::Tensor target_idx = torch::cat(targets, -1);
torch::Tensor neighbor_idx = torch::cat(neighbors, -1);
adj_matrix = adj_matrix.index_put({target_idx, neighbor_idx}, torch::ones(1));
return adj_matrix;
}
cpp 扩展setup.py:
setup(
# package name
name='torch_cpp_extension',
# module version
version="0.1",
# cpp file path & module names
ext_modules=[
CppExtension(name='natural_neighbor', sources=['src/natural_neighbor.cpp'],
extra_compile_args=['-fopenmp'] # here is compilation flag)
],
cmdclass={
'build_ext': BuildExtension
}
)
总结:
for
迭代?我的环境:
torch.utils.cpp_extension
完成的,我只使用命令python setup.py install
。系统中其他编译器是 gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0我的猜测是你的 openmp 基本上发送到 cpu,然后再次发送到 GPU,造成速度减慢。仔细检查集成是否正常工作。
第二个选项是 std 操作在 GPU 上的 std::set_intersection 上速度很慢,我不确定它是如何为 GPU 编译的。再次,你可能会将东西移至CPU,这会减慢你的速度