打开 MPI 无法生成进程

问题描述 投票:0回答:1

我正在尝试运行一个使用命令 MPI_Comm_spawn 生成工作程序的程序,但是假设我将要生成的进程数设置为 4,主进程将生成 3 个,然后崩溃并显示以下错误代码:

WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: sr530-01
  PID:        154333
  Message:    connect() to myipadd:1028 failed
  Error:      Operation now in progress (115)**

我总是可以在崩溃之前生成 n- 1 个工作进程。我将代码分成两个文件,一个用于主代码,一个用于工作代码。在主代码中,我设置了一个变量worker_count,这决定了工人的数量,无论我设置的值如何,我总是得到相同的错误。

主码

#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"

int main(int argc, char *argv[]) {
    int rank, size;
    int worker_count = 3;  // Number of worker processes to spawn
    MPI_Comm worker_comm;
    int array_of_errcodes[3];  // Array to store error codes

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    if (rank == 0) {  // Master process
        printf("Master process is running.\n");

        // Define the command and arguments for the worker program
        const char *worker_program = "./worker";  // Path to the worker program executable
        char *worker_argv[] = {"./worker", NULL};  // Arguments for the worker program
        int maxprocs = worker_count;  // Number of worker processes to spawn
        MPI_Info info = MPI_INFO_NULL;  // No additional info

        // Spawn worker processes
        MPI_Comm_spawn(worker_program, worker_argv, maxprocs, info, 0, MPI_COMM_SELF, &worker_comm, array_of_errcodes);

        // Optionally, you can perform work with the worker processes here

        // Wait for all worker processes to complete
        MPI_Barrier(worker_comm);

        // Disconnect the intercommunicator only once
        if (worker_comm != MPI_COMM_NULL) {
            MPI_Comm_disconnect(&worker_comm);
        }

        printf("Master process is done.\n");
    }

    MPI_Finalize();
    return 0;
}

工人代码

#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"

int main(int argc, char *argv[]) {
    int rank, size;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    if (rank != 0) {  // Worker processes (rank > 0)
        printf("Worker process %d is running.\n", rank);

        // Perform the work needed by worker processes

        printf("Worker process %d is done.\n", rank);
    }

    MPI_Finalize();
    return 0;
}

这是我运行主进程时的完整输出,输出+错误,在这种情况下我已将worker_count设置为2:

Master process is running.
Worker process 1 is running.
Worker process 1 is done.
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: sr530-01
  PID:        154333
  Message:    connect() to 0.0.0.0:1028 failed **fake ip address
  Error:      Operation now in progress (115)
c mpi multiprocess
1个回答
0
投票

OpenMPI 在您的实例中没有失败。主进程中的 MPI_Barrier 正在等待通信器中的进程,而您的子进程已经调用 MPI_Finalize 并退出程序。如果删除 MPI_Barrier 和 MPI_Comm_disconnect,程序将按预期工作!

© www.soinside.com 2019 - 2024. All rights reserved.