mpi4py irecv导致分段错误

问题描述 投票:0回答:1

我正在运行以下代码,该代码使用命令rank将数组从0 1发送到mpirun -n 2 python -u test_irecv.py > output 2>&1

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
asyncr = 1
size_arr = 10000

if comm.Get_rank()==0:
    arrs = np.zeros(size_arr)
    if asyncr: comm.isend(arrs, dest=1).wait()
    else: comm.send(arrs, dest=1)
else:
    if asyncr: arrv = comm.irecv(source=0).wait()
    else: arrv = comm.recv(source=0)

print('Done!', comm.Get_rank())

asyncr = 0在同步模式下运行可提供预期的输出

Done! 0
Done! 1

但是在异步模式下使用asyncr = 1运行会产生以下错误。我需要知道为什么它在同步模式下可以正常运行,而在异步模式下却不可以。

带有asyncr = 1的输出:

Done! 0
[nia1477:420871:0:420871] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x138)
==== backtrace ====
 0 0x0000000000010e90 __funlockfile()  ???:0
 1 0x00000000000643d1 ompi_errhandler_request_invoke()  ???:0
 2 0x000000000008a8b5 __pyx_f_6mpi4py_3MPI_PyMPI_wait()  /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:49819
 3 0x000000000008a8b5 __pyx_f_6mpi4py_3MPI_PyMPI_wait()  /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:49819
 4 0x000000000008a8b5 __pyx_pf_6mpi4py_3MPI_7Request_34wait()  /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:83838
 5 0x000000000008a8b5 __pyx_pw_6mpi4py_3MPI_7Request_35wait()  /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:83813
 6 0x00000000000966a3 _PyMethodDef_RawFastCallKeywords()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Objects/call.c:690
 7 0x000000000009eeb9 _PyMethodDescr_FastCallKeywords()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Objects/descrobject.c:288
 8 0x000000000006e611 call_function()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:4563
 9 0x000000000006e611 _PyEval_EvalFrameDefault()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:3103
10 0x0000000000177644 _PyEval_EvalCodeWithName()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:3923
11 0x000000000017774e PyEval_EvalCodeEx()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:3952
12 0x000000000017777b PyEval_EvalCode()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:524
13 0x00000000001aab72 run_mod()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/pythonrun.c:1035
14 0x00000000001aab72 PyRun_FileExFlags()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/pythonrun.c:988
15 0x00000000001aace6 PyRun_SimpleFileExFlags()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/pythonrun.c:430
16 0x00000000001cad47 pymain_run_file()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:425
17 0x00000000001cad47 pymain_run_filename()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:1520
18 0x00000000001cad47 pymain_run_python()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:2520
19 0x00000000001cad47 pymain_main()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:2662
20 0x00000000001cb1ca _Py_UnixMain()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:2697
21 0x00000000000202e0 __libc_start_main()  ???:0
22 0x00000000004006ba _start()  /tmp/nix-build-glibc-2.24.drv-0/glibc-2.24/csu/../sysdeps/x86_64/start.S:120
===================
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 420871 on node nia1477 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

版本如下:

  • Python:3.7.0
  • mpi4py:3.0.0
  • [mpiexec --version给出mpiexec (OpenRTE) 3.1.2
  • [mpicc -v给出icc version 18.0.3 (gcc version 7.3.0 compatibility)

在具有asyncr = 1的另一个系统中以MPICH运行时给出以下输出。

Done! 0
Traceback (most recent call last):
  File "test_irecv.py", line 14, in <module>
    if asyncr: arrv = comm.irecv(source=0).wait()
  File "mpi4py/MPI/Request.pyx", line 235, in mpi4py.MPI.Request.wait
  File "mpi4py/MPI/msgpickle.pxi", line 411, in mpi4py.MPI.PyMPI_wait
mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[23830,1],1]
  Exit code:    1
--------------------------------------------------------------------------
[master:01977] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[master:01977] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
openmpi mpich mpi4py
1个回答
0
投票

显然,这是mpi4py中的已知问题,如https://bitbucket.org/mpi4py/mpi4py/issues/65/mpi_err_truncate-message-truncated-when中所述。利桑德罗·达辛(Lisandro Dalcin)说

irecv()对大消息的实现要求用户传递一个类似缓冲区的对象,该对象足够大以接收腌制的流。这没有记录(与大多数mpi4py一样),甚至是非显而易见的和非Python的...

解决方法是将足够大的预分配bytearray传递给irecv。一个有效的示例如下。

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
size_arr = 10000

if comm.Get_rank()==0:
    arrs = np.zeros(size_arr)
    comm.isend(arrs, dest=1).wait()
else:
    arrv = comm.irecv(bytearray(1<<20), source=0).wait()

print('Done!', comm.Get_rank())
© www.soinside.com 2019 - 2024. All rights reserved.