我正在尝试运行一个使用命令 MPI_Comm_spawn 生成工作程序的程序,但是假设我将要生成的进程数设置为 4,主进程将生成 3 个,然后崩溃并显示以下错误代码:
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: sr530-01
PID: 154333
Message: connect() to myipadd:1028 failed
Error: Operation now in progress (115)**
我总是可以在崩溃之前生成 n- 1 个工作进程。我将代码分成两个文件,一个用于主代码,一个用于工作代码。在主代码中,我设置了一个变量worker_count,这决定了工人的数量,无论我设置的值如何,我总是得到相同的错误。
主码
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
int main(int argc, char *argv[]) {
int rank, size;
int worker_count = 3; // Number of worker processes to spawn
MPI_Comm worker_comm;
int array_of_errcodes[3]; // Array to store error codes
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (rank == 0) { // Master process
printf("Master process is running.\n");
// Define the command and arguments for the worker program
const char *worker_program = "./worker"; // Path to the worker program executable
char *worker_argv[] = {"./worker", NULL}; // Arguments for the worker program
int maxprocs = worker_count; // Number of worker processes to spawn
MPI_Info info = MPI_INFO_NULL; // No additional info
// Spawn worker processes
MPI_Comm_spawn(worker_program, worker_argv, maxprocs, info, 0, MPI_COMM_SELF, &worker_comm, array_of_errcodes);
// Optionally, you can perform work with the worker processes here
// Wait for all worker processes to complete
MPI_Barrier(worker_comm);
// Disconnect the intercommunicator only once
if (worker_comm != MPI_COMM_NULL) {
MPI_Comm_disconnect(&worker_comm);
}
printf("Master process is done.\n");
}
MPI_Finalize();
return 0;
}
工人代码
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
int main(int argc, char *argv[]) {
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (rank != 0) { // Worker processes (rank > 0)
printf("Worker process %d is running.\n", rank);
// Perform the work needed by worker processes
printf("Worker process %d is done.\n", rank);
}
MPI_Finalize();
return 0;
}
这是我运行主进程时的完整输出,输出+错误,在这种情况下我已将worker_count设置为2:
Master process is running.
Worker process 1 is running.
Worker process 1 is done.
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: sr530-01
PID: 154333
Message: connect() to 0.0.0.0:1028 failed **fake ip address
Error: Operation now in progress (115)
OpenMPI 在您的实例中没有失败。主进程中的 MPI_Barrier 正在等待通信器中的进程,而您的子进程已经调用 MPI_Finalize 并退出程序。如果删除 MPI_Barrier 和 MPI_Comm_disconnect,程序将按预期工作!