MPI_Bcast 在大型字符数组上返回退出代码:139

问题描述 投票:0回答:1

如果您能帮我解决以下情况,我将不胜感激。

我通过连续两个步骤广播(在本地主机上)一个字符数组:

  1. MPI_Bcast 数组的大小

  2. MPI_Bcast 数组本身

这是通过动态进程生成来完成的。数据通信工作得很好,直到数组的大小(大约)超过 8375000 个元素的数量。这是 8.375Mb 的数据,根据现有文档,这似乎相当小。据我在其他文章中读到,MPI 最多支持 2^31 个元素。超过 8375000 个元素后,我收到带有 EXIT CODE: 139 的 MPI 错误。

为此,我在 

valgrind

上测试了代码。该摘要并未表明存在令人担忧的情况,但我收到了各种以 Syscall param writev(vector[...]) points to uninitialised byte(s) 开头的 MPI 相关错误。这是日志的尾部。

...
==15125== Syscall param writev(vector[...]) points to uninitialised byte(s)
==15125==    at 0x5B83327: writev (writev.c:26)
==15125==    by 0x8978FF1: MPL_large_writev (in /usr/lib/libmpi.so.12.1.6)
==15125==    by 0x8961D4B: MPID_nem_tcp_iStartContigMsg (in /usr/lib/libmpi.so.12.1.6)
==15125==    by 0x8939E15: MPIDI_CH3_RndvSend (in /usr/lib/libmpi.so.12.1.6)
==15125==    by 0x895DD69: MPID_nem_lmt_RndvSend (in /usr/lib/libmpi.so.12.1.6)
==15125==    by 0x8945FE9: MPID_Send (in /usr/lib/libmpi.so.12.1.6)
==15125==    by 0x88B1D84: MPIC_Send (in /usr/lib/libmpi.so.12.1.6)
==15125==    by 0x886EC08: MPIR_Bcast_inter_remote_send_local_bcast (in /usr/lib/libmpi.so.12.1.6)
==15125==    by 0x87C28F2: MPIR_Bcast_impl (in /usr/lib/libmpi.so.12.1.6)
==15125==    by 0x87C3183: PMPI_Bcast (in /usr/lib/libmpi.so.12.1.6)
==15125==    by 0x50B9CF5: QuanticBoost::Calculators::Exposures::Mpi::dynamic_mpi_master(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, QuanticBoost::WorkflowContext&) (in /home/ubuntu/Documents/code/quanticboostnew-build/release/lib/libCppLib.so)
==15125==    by 0x5140CBA: QuanticBoost::MpiExposureSpawnTask::execute(QuanticBoost::WorkflowContext&) (in /home/ubuntu/Documents/code/quanticboostnew-build/release/lib/libCppLib.so)
==15125==  Address 0x1ffefff524 is on thread 1's stack
==15125==  in frame #3, created by MPIDI_CH3_RndvSend (???:)
==15125==  Uninitialised value was created by a stack allocation
==15125==    at 0x8939D70: MPIDI_CH3_RndvSend (in /usr/lib/libmpi.so.12.1.6)
==15125== 
==15125== 
==15125== HEAP SUMMARY:
==15125==     in use at exit: 184 bytes in 6 blocks
==15125==   total heap usage: 364,503 allocs, 364,497 frees, 204,665,377 bytes allocated
==15125== 
==15125== LEAK SUMMARY:
==15125==    definitely lost: 0 bytes in 0 blocks
==15125==    indirectly lost: 0 bytes in 0 blocks
==15125==      possibly lost: 0 bytes in 0 blocks
==15125==    still reachable: 184 bytes in 6 blocks
==15125==         suppressed: 0 bytes in 0 blocks
==15125== Reachable blocks (those to which a pointer was found) are not shown.
==15125== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==15125== 
==15125== For counts of detected and suppressed errors, rerun with: -v
==15125== ERROR SUMMARY: 15 errors from 10 contexts (suppressed: 0 from 0)

你能帮我识别 vagrind 错误并解决代码 139 的 MPI 故障吗?

下面我分享了
master

worker 代码的最小代码片段,以及 error 代码的输出。 代码片段(

master

): std::cout << "Spawning "<< dynamic_procs << " " << worker_path.string() <<std::endl; MPI_Comm_spawn( worker_path.string().c_str(), MPI_ARGV_NULL, dynamic_procs, info, 0, MPI_COMM_SELF, //intra-communication &intercomm, //inter-communication MPI_ERRCODES_IGNORE); std::cout << "\n________________ MASTER: MPI spawning starts _________________ \n" << std::endl; // I normally send the size of the char array in the 1st Bcast // and the array itself in a 2nd Bcast // // but MPI starts failing somewhere beyond 8.375e6 elements // though I expect that happening after 2^31 array elements, or not??? //I test the limits of the array size by overriding manually int in_str_len=8.375e6; //Until this size it all works //int in_str_len=8.376e6; //This does NOT work //int in_str_len=8.3765e6; //This does NOT work and so on MPI_Bcast( &in_str_len, //void* data, 1, //int count, MPI_INT, //MPI_Datatype datatype, MPI_ROOT, //int use MPI_ROOT not own set root! intercomm //MPI_Comm communicator ); //Initialize a test buffer std::string s (in_str_len, 'x'); //It works //char d[in_str_len+1]; //It works /* * The 2nd MPI_Bcast will send the data to all nodes */ MPI_Bcast( s.data(), //void* data, in_str_len, //int count, MPI_BYTE, //MPI_Datatype datatype, MPI_BYTE,MPI_CHAR work MPI_ROOT, //int use MPI_ROOT not own set root! intercomm //MPI_Comm communicator );

代码片段(
worker

): std::cout << "I am in a spawned process " << rank << "/" << dynamic_procs << " from host " << name << std::endl; int in_str_len; //Receive stream size; MPI_Bcast( &in_str_len, //void* data, 1, //int count, MPI_INT, //MPI_Datatype datatype, 0, //int root, parent //MPI_Comm communication with parent (not MPI_COMM_WORLD) ); std::cout << "1st MPI_Bcast received len: "<< in_str_len * 1e-6<<"Mb" << std::endl; MPI_Barrier(MPI_COMM_WORLD); //Tested with and without the barrier char data[in_str_len+1]; std::cout << "Create char array for 2nd MPI_Bcast with length: "<< in_str_len << std::endl; MPI_Bcast( data, //void* data, in_str_len, //int count, MPI_BYTE, //MPI_Datatype datatype, 0, //int root, parent //MPI_Comm communication with parent (not MPI_COMM_WORLD) ); std::cout << "2nd MPI_Bcast received data: " << sizeof(data) << std::endl;

使用大数组收到
错误

Spawning 3 /home/ubuntu/Documents/code/build/release/bin/mpi_worker ________________ MASTER: MPI spawning starts _________________ I am in a spawned process 1/3 from host ip-172-31-30-254 I am in a spawned process 0/3 from host ip-172-31-30-254 I am in a spawned process 2/3 from host ip-172-31-30-254 1st MPI_Bcast received len: 8.3765Mb 1st MPI_Bcast received len: 8.3765Mb 1st MPI_Bcast received len: 8.3765Mb Create char array for 2nd MPI_Bcast with length: 8376500 Create char array for 2nd MPI_Bcast with length: 8376500 Create char array for 2nd MPI_Bcast with length: 8376500 =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 9690 RUNNING AT localhost = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES ===================================================================================

PS:如果您需要任何额外信息或对我的帖子进行进一步编辑,请告诉我。

arrays c++11 char mpi valgrind
1个回答
0
投票

首先它说了一些关于未初始化数据的事情。那是真实的。您正在发送一个尚未填写的数组。我将省去您的技术细节,但您可以在谷歌上搜索“页面实例化”。基本上,如果不初始化的话,内存还不存在。
  1. 但真正的问题是:“未初始化的值是由堆栈分配创建的”
  2. 您可以使用
char data[somethingbig]

创建数组。那就是


不是语言标准(实际上,它是在 C99 中,然后在 C11 中再次弃用;基本上:不要这样做。)
  1. 堆栈上的数据和您的堆栈的大小是有限的。
  2. 所以:任何宏观尺寸的数组,请使用
malloc

哦,你把这个标记为“c++”。在这种情况下,仅对大型数组使用 

std::vector

。没有理由使用其他任何东西,正如您所看到的,很多人不这样做。

    

© www.soinside.com 2019 - 2024. All rights reserved.