我正在运行以下代码,该代码使用命令rank
将数组从0
1
发送到mpirun -n 2 python -u test_irecv.py > output 2>&1
。
from mpi4py import MPI
import numpy as np
comm = MPI.COMM_WORLD
asyncr = 1
size_arr = 10000
if comm.Get_rank()==0:
arrs = np.zeros(size_arr)
if asyncr: comm.isend(arrs, dest=1).wait()
else: comm.send(arrs, dest=1)
else:
if asyncr: arrv = comm.irecv(source=0).wait()
else: arrv = comm.recv(source=0)
print('Done!', comm.Get_rank())
与asyncr = 0
在同步模式下运行可提供预期的输出
Done! 0
Done! 1
但是在异步模式下使用asyncr = 1
运行会产生以下错误。我需要知道为什么它在同步模式下可以正常运行,而在异步模式下却不可以。
带有asyncr = 1
的输出:
Done! 0
[nia1477:420871:0:420871] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x138)
==== backtrace ====
0 0x0000000000010e90 __funlockfile() ???:0
1 0x00000000000643d1 ompi_errhandler_request_invoke() ???:0
2 0x000000000008a8b5 __pyx_f_6mpi4py_3MPI_PyMPI_wait() /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:49819
3 0x000000000008a8b5 __pyx_f_6mpi4py_3MPI_PyMPI_wait() /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:49819
4 0x000000000008a8b5 __pyx_pf_6mpi4py_3MPI_7Request_34wait() /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:83838
5 0x000000000008a8b5 __pyx_pw_6mpi4py_3MPI_7Request_35wait() /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:83813
6 0x00000000000966a3 _PyMethodDef_RawFastCallKeywords() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Objects/call.c:690
7 0x000000000009eeb9 _PyMethodDescr_FastCallKeywords() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Objects/descrobject.c:288
8 0x000000000006e611 call_function() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:4563
9 0x000000000006e611 _PyEval_EvalFrameDefault() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:3103
10 0x0000000000177644 _PyEval_EvalCodeWithName() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:3923
11 0x000000000017774e PyEval_EvalCodeEx() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:3952
12 0x000000000017777b PyEval_EvalCode() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:524
13 0x00000000001aab72 run_mod() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/pythonrun.c:1035
14 0x00000000001aab72 PyRun_FileExFlags() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/pythonrun.c:988
15 0x00000000001aace6 PyRun_SimpleFileExFlags() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/pythonrun.c:430
16 0x00000000001cad47 pymain_run_file() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:425
17 0x00000000001cad47 pymain_run_filename() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:1520
18 0x00000000001cad47 pymain_run_python() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:2520
19 0x00000000001cad47 pymain_main() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:2662
20 0x00000000001cb1ca _Py_UnixMain() /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:2697
21 0x00000000000202e0 __libc_start_main() ???:0
22 0x00000000004006ba _start() /tmp/nix-build-glibc-2.24.drv-0/glibc-2.24/csu/../sysdeps/x86_64/start.S:120
===================
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 420871 on node nia1477 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
版本如下:
3.7.0
3.0.0
mpiexec --version
给出mpiexec (OpenRTE) 3.1.2
mpicc -v
给出icc version 18.0.3 (gcc version 7.3.0 compatibility)
在具有asyncr = 1
的另一个系统中以MPICH
运行时给出以下输出。
Done! 0
Traceback (most recent call last):
File "test_irecv.py", line 14, in <module>
if asyncr: arrv = comm.irecv(source=0).wait()
File "mpi4py/MPI/Request.pyx", line 235, in mpi4py.MPI.Request.wait
File "mpi4py/MPI/msgpickle.pxi", line 411, in mpi4py.MPI.PyMPI_wait
mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[23830,1],1]
Exit code: 1
--------------------------------------------------------------------------
[master:01977] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[master:01977] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
显然,这是mpi4py
中的已知问题,如https://bitbucket.org/mpi4py/mpi4py/issues/65/mpi_err_truncate-message-truncated-when中所述。利桑德罗·达辛(Lisandro Dalcin)说
irecv()对大消息的实现要求用户传递一个类似缓冲区的对象,该对象足够大以接收腌制的流。这没有记录(与大多数mpi4py一样),甚至是非显而易见的和非Python的...
解决方法是将足够大的预分配bytearray
传递给irecv
。一个有效的示例如下。
from mpi4py import MPI
import numpy as np
comm = MPI.COMM_WORLD
size_arr = 10000
if comm.Get_rank()==0:
arrs = np.zeros(size_arr)
comm.isend(arrs, dest=1).wait()
else:
arrv = comm.irecv(bytearray(1<<20), source=0).wait()
print('Done!', comm.Get_rank())