我试图通过集群上的奇点容器运行 MPI 应用程序,并首先测试一个简单的程序,但我遇到了麻烦。
这是测试程序:
program hello
include 'mpif.h'
integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
print*, 'node', rank, ': Hello world'
call MPI_FINALIZE(ierror)
end
我从 https://sylabs.io/guides/3.5/user-guide/mpi.html 获得以下奇点配方。我在本地生成 sif 容器,然后将其移动到集群:
Bootstrap: docker
From: ubuntu:latest
%environment
export OMPI_DIR=/opt/ompi
export SINGULARITY_OMPI_DIR=$OMPI_DIR
export SINGULARITYENV_APPEND_PATH=$OMPI_DIR/bin
export SINGULAIRTYENV_APPEND_LD_LIBRARY_PATH=$OMPI_DIR/lib
%post
echo "Installing required packages..."
apt-get update && apt-get install -y wget git bash gcc gfortran g++ make file
echo "Installing Open MPI"
export OMPI_DIR=/opt/ompi
export OMPI_VERSION=4.0.1
export OMPI_URL="https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-$OMPI_VERSION.tar.bz2"
mkdir -p /tmp/ompi
mkdir -p /opt
# Download
cd /tmp/ompi && wget -O openmpi-$OMPI_VERSION.tar.bz2 $OMPI_URL && tar -xjf openmpi-$OMPI_VERSION.tar.bz2
# Compile and install
cd /tmp/ompi/openmpi-$OMPI_VERSION && ./configure --prefix=$OMPI_DIR && make install
我在作业文件中加载集群中相同的 openmpi 环境:
module load OpenMPI/4.0.1-GCC-8.3.0
singularity exec mpicont.sif bash script
mpirun -np 4 singularity exec mpicont.sif ./here/hello
对脚本的第一个奇点调用编译文件的位置:
export PATH=$OMPI_DIR/bin:$PATH
export LD_LIBRARY_PATH=$OMPI_DIR/lib:$LD_LIBRARY_PATH
export MANPATH=$OMPI_DIR/share/man:$MANPATH
mpif90 -o hello hello.f90
我发现第一次调用工作得很好并且生成了可执行文件
hello
,但是mpirun
命令失败并出现以下错误:
anode239:04778] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[anode239:04778] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
--------------------------------------------------------------------------
A call to mkdir was unable to create the desired directory:
Directory: /scratch
Error: Read-only file system
Please check to ensure you have adequate permissions to perform
the desired operation.
--------------------------------------------------------------------------
[anode239:04778] [[11585,1],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 107
[anode239:04778] [[11585,1],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 346
[anode239:04778] [[11585,1],0] ORTE_ERROR_LOG: Error in file base/ess_base_std_app.c at line 141
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_session_dir failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[anode239:04778] [[11585,1],0] ORTE_ERROR_LOG: Error in file ess_pmi_module.c at line 416
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_ess_init failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
错误的根源是什么以及如何修复它?
假设项目如下所示:
├── build.sh
├── hello.f90
└── mpicont.def
首先创建
.sif
图像文件。
sudo singularity build mpicont.sif mpicont.def
这是使用 Singularity 构建可执行文件的脚本。
🗎
build.sh
#!/bin/bash
export PATH=$OMPI_DIR/bin:$PATH
export LD_LIBRARY_PATH=$OMPI_DIR/lib:$LD_LIBRARY_PATH
export MANPATH=$OMPI_DIR/share/man:$MANPATH
mpif90 -o hello hello.f90
创建
hello
可执行文件。
singularity exec mpicont.sif bash build.sh
该项目现在应如下所示:
├── build.sh
├── hello
├── hello.f90
├── mpicont.def
└── mpicont.sif
终于运行了。
mpirun -np 4 singularity exec mpicont.sif ./hello
node 2 : Hello world
node 3 : Hello world
node 0 : Hello world
node 1 : Hello world
所以效果很好。但示例代码中没有任何内容尝试创建目录。该错误消息特别提到您被拒绝在
mkdir
下的目录上运行 /scratch
。您能分享一下尝试执行此操作的代码吗?因为根据现在问题中提供的信息,这应该可行。 🤔