在集群上的 Singularity 容器中运行的 MPI 程序

问题描述 投票:0回答:1

我试图通过集群上的奇点容器运行 MPI 应用程序,并首先测试一个简单的程序,但我遇到了麻烦。

这是测试程序:

program hello
include 'mpif.h'
integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
   
   call MPI_INIT(ierror)
   call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
   call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
   print*, 'node', rank, ': Hello world'
   call MPI_FINALIZE(ierror)

end

我从 https://sylabs.io/guides/3.5/user-guide/mpi.html 获得以下奇点配方。我在本地生成 sif 容器,然后将其移动到集群:

Bootstrap: docker
From: ubuntu:latest

%environment
    export OMPI_DIR=/opt/ompi
    export SINGULARITY_OMPI_DIR=$OMPI_DIR
    export SINGULARITYENV_APPEND_PATH=$OMPI_DIR/bin
    export SINGULAIRTYENV_APPEND_LD_LIBRARY_PATH=$OMPI_DIR/lib

%post
    echo "Installing required packages..."
    apt-get update && apt-get install -y wget git bash gcc gfortran g++ make file

    echo "Installing Open MPI"
    export OMPI_DIR=/opt/ompi
    export OMPI_VERSION=4.0.1
    export OMPI_URL="https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-$OMPI_VERSION.tar.bz2"
    mkdir -p /tmp/ompi
    mkdir -p /opt
    # Download
    cd /tmp/ompi && wget -O openmpi-$OMPI_VERSION.tar.bz2 $OMPI_URL && tar -xjf openmpi-$OMPI_VERSION.tar.bz2
    # Compile and install
    cd /tmp/ompi/openmpi-$OMPI_VERSION && ./configure --prefix=$OMPI_DIR && make install

我在作业文件中加载集群中相同的 openmpi 环境:

module load OpenMPI/4.0.1-GCC-8.3.0
singularity exec mpicont.sif bash script
mpirun -np 4 singularity exec mpicont.sif ./here/hello

对脚本的第一个奇点调用编译文件的位置:

export PATH=$OMPI_DIR/bin:$PATH
export LD_LIBRARY_PATH=$OMPI_DIR/lib:$LD_LIBRARY_PATH
export MANPATH=$OMPI_DIR/share/man:$MANPATH

mpif90 -o hello hello.f90

我发现第一次调用工作得很好并且生成了可执行文件

hello
,但是
mpirun
命令失败并出现以下错误:

anode239:04778] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[anode239:04778] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
--------------------------------------------------------------------------
A call to mkdir was unable to create the desired directory:

  Directory: /scratch
  Error:     Read-only file system

Please check to ensure you have adequate permissions to perform
the desired operation.
--------------------------------------------------------------------------
[anode239:04778] [[11585,1],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 107
[anode239:04778] [[11585,1],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 346
[anode239:04778] [[11585,1],0] ORTE_ERROR_LOG: Error in file base/ess_base_std_app.c at line 141
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[anode239:04778] [[11585,1],0] ORTE_ERROR_LOG: Error in file ess_pmi_module.c at line 416
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------

错误的根源是什么以及如何修复它?

docker fortran mpi executable singularity-container
1个回答
0
投票

假设项目如下所示:

├── build.sh
├── hello.f90
└── mpicont.def

首先创建

.sif
图像文件。

sudo singularity build mpicont.sif mpicont.def

这是使用 Singularity 构建可执行文件的脚本。

🗎

build.sh

#!/bin/bash

export PATH=$OMPI_DIR/bin:$PATH
export LD_LIBRARY_PATH=$OMPI_DIR/lib:$LD_LIBRARY_PATH
export MANPATH=$OMPI_DIR/share/man:$MANPATH

mpif90 -o hello hello.f90

创建

hello
可执行文件。

singularity exec mpicont.sif bash build.sh

该项目现在应如下所示:

├── build.sh
├── hello
├── hello.f90
├── mpicont.def
└── mpicont.sif

终于运行了。

mpirun -np 4 singularity exec mpicont.sif ./hello
 node           2 : Hello world
 node           3 : Hello world
 node           0 : Hello world
 node           1 : Hello world

所以效果很好。但示例代码中没有任何内容尝试创建目录。该错误消息特别提到您被拒绝在

mkdir
下的目录上运行
/scratch
。您能分享一下尝试执行此操作的代码吗?因为根据现在问题中提供的信息,这应该可行。 🤔

© www.soinside.com 2019 - 2024. All rights reserved.