仅在特定的核心数上出现MPI内存损坏。

问题描述 投票:0回答:1

对于一些背景,我正在用MPI对一个基本的PDE求解器进行并行化。该程序需要一个域,并为每个处理器分配一个覆盖其中一部分的网格。如果我使用单核或四核运行,程序运行得很好。然而,如果我用两个或三个核心运行,我得到一个像下面这样的核心转储。

*** Error in `MeshTest': corrupted size vs. prev_size: 0x00000000018bd540 ***
======= Backtrace: =========
*** Error in `MeshTest': corrupted size vs. prev_size: 0x00000000022126e0 ***
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fc1a63f77e5]
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x80dfb)[0x7fc1a6400dfb]
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fca753f77e5]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fc1a640453c]
/lib/x86_64-linux-gnu/libc.so.6(+0x7e9dc)[0x7fca753fe9dc]
/usr/lib/libmpi.so.12(+0x25919)[0x7fc1a6d25919]
/lib/x86_64-linux-gnu/libc.so.6(+0x80678)[0x7fca75400678]
/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x52a9)[0x7fc198fe52a9]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fca7540453c]
/usr/lib/libmpi.so.12(ompi_mpi_finalize+0x412)[0x7fc1a6d41a22]
/usr/lib/libmpi.so.12(+0x25919)[0x7fca75d25919]
MeshTest(_ZN15MPICommunicator7cleanupEv+0x26)[0x422e70]
/usr/lib/openmpi/lib/openmpi/mca_btl_tcp.so(+0x4381)[0x7fca68844381]
MeshTest(main+0x364)[0x41af2a]
/usr/lib/libopen-pal.so.13(mca_base_component_close+0x19)[0x7fca74c88fe9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fc1a63a0830]
/usr/lib/libopen-pal.so.13(mca_base_components_close+0x42)[0x7fca74c89062]
MeshTest(_start+0x29)[0x41aaf9]
/usr/lib/libmpi.so.12(+0x7d3b4)[0x7fca75d7d3b4]
======= Memory map: ========
<insert core dump>

我一直在追踪错误的原因,当我创建一个新的网格时。

Result Domain::buildGrid(unsigned int shp[2], pair2<double> &bounds){
  // ... Unrelated code ...

  // grid is already allocated and needs to be cleared.
  delete grid;                                                                                                         
  grid = new Grid(bounds, shp, nghosts);                                                                                                                                                                                                    
  return SUCCESS;                                                                                                    
}

Grid::Grid(const pair2<double>& bounds, unsigned int sz[2], unsigned int nghosts){
  // ... Code unrelated to memory allocation ...

  // Construct the grid. Start by adding ghost points.
  shp[0] = sz[0] + 2*nghosts;
  shp[1] = sz[1] + 2*nghosts;
  try{
    points[0] = new double[shp[0]];
    points[1] = new double[shp[1]];
    for(int i = 0; i < shp[0]; i++){
      points[0][i] = grid_bounds[0][0] + (i - (int)nghosts)*dx;
    }
    for(int j = 0; j < shp[1]; j++){
      points[1][j] = grid_bounds[1][0] + (j - (int)nghosts)*dx;
    }
  }
  catch(std::bad_alloc& ba){
    std::cout << "Failed to allocate memory for grid.\n";
    shp[0] = 0;
    shp[1] = 0;
    dx = 0;
    points[0] = NULL;
    points[1] = NULL;
  }
}

Grid::~Grid(){
  delete[] points[0];
  delete[] points[1];
}

据我所知,我的MPI实现很好,而且所有MPI依赖的功能都能正常运行。Domain 类似乎可以正常工作。我假设有什么东西在其范围之外非法访问内存,但我不知道在哪里;在这一点上,代码实际上只是初始化MPI,加载一些参数,设置网格(唯一的内存访问发生在它的构建过程中),然后呼吁 MPI_Finalize() 并返回。

c++ mpi
1个回答
0
投票

事实证明,在我的 Grid 构造函数,同时分配点数(它读作 points[0][j] = ... 当我把代码复制到我的帖子中时,我不知为何发现并纠正了这个错误,但在我的代码中没有。这个错误只出现在2和3核运行中,因为1和4核运行的网格是完全正方形的,所以 shp[0] 等于 shp[1]. 谢谢大家的提示。看到是这么简单的事情,我现在觉得有点不好意思了。

© www.soinside.com 2019 - 2024. All rights reserved.