我怎么能在Slurm下运行Open MPI

问题描述 投票:2回答:2

我无法通过Open MPISlurm下运行Slurm-script

通常,我可以获取主机名并在我的机器上运行Open MPI

$ mpirun hostname
myHost
$ cd NPB3.3-SER/ && make ua CLASS=B && mpirun -n 1 bin/ua.B.x inputua.data # Works

但是,如果我通过slurm脚本执行相同的操作mpirun hostname返回空字符串,因此我无法运行mpirun -n 1 bin/ua.B.x inputua.data

杀戮人民-script.是:

#!/bin/bash
#SBATCH -o slurm.out        # STDOUT
#SBATCH -e slurm.err        # STDERR
#SBATCH --mail-type=ALL

export LD_LIBRARY_PATH="/usr/lib/openmpi/lib"
mpirun hostname > output.txt # Returns empty
cd NPB3.3-SER/ 
make ua CLASS=B 
mpirun --host myHost -n 1 bin/ua.B.x inputua.data
$ sbatch -N1 slurm-script.sh
Submitted batch job 1

我收到的错误:

There are no allocated resources for the application
  bin/ua.B.x
that match the requested mapping:    
------------------------------------------------------------------
Verify that you have mapped the allocated resources properly using the
--host or --hostfile specification.

A daemon (pid unknown) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
------------------------------------------------------------------
openmpi slurm sbatch
2个回答
1
投票

如果Slurm和OpenMPI是最新版本,请确保使用Slurm支持编译OpenMPI(运行ompi_info | grep slurm以查找)并在提交脚本中运行srun bin/ua.B.x inputua.data

或者,mpirun bin/ua.B.x inputua.data也应该工作。

如果在没有Slurm支持的情况下编译OpenMPI,则以下内容应该有效:

srun hostname > output.txt
cd NPB3.3-SER/ 
make ua CLASS=B 
mpirun --hostfile output.txt -n 1 bin/ua.B.x inputua.data

还要确保通过运行export LD_LIBRARY_PATH="/usr/lib/openmpi/lib",您不会覆盖必要的其他库路径。更好的可能是export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/lib/openmpi/lib"(或a more complex version,如果你想避免领先的:,如果它最初是空的。)


0
投票

你需要的是:1)运行mpirun,2)来自slurm,3)与--host。要确定谁不负责任(问题1),你可以测试一些事情。无论您测试什么,都应该通过命令行(CLI)和slurm(S)进行完全相同的测试。据了解,在CLI和S的情况下,这些测试中的一些将产生不同的结果。

一些注意事项是:1)你没有在CLI和S中测试完全相同的东西.2)你说你“无法运行mpirun -n 1 bin/ua.B.x inputua.data”,而问题实际上是mpirun --host myHost -n 1 bin/ua.B.x inputua.data。 3)mpirun hostname > output.txt返回空文件(问题2)的事实不一定与您的主要问题具有相同的来源,请参阅上面的段落。您可以通过使用scontrol show hostnames或环境变量SLURM_NODELISTscontrol show hostnames所基于的)来克服此问题,但这不会解决问题1。


To work around Problem 2, which is not the most important, try a few things via both CLI and S. The slurm script below may be helpful.
#SBATCH -o slurm_hostname.out        # STDOUT
#SBATCH -e slurm_hostname.err        # STDERR
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+${LD_LIBRARY_PATH}:}/usr/lib64/openmpi/lib"

mpirun hostname > hostname_mpirun.txt               # 1. Returns values ok for me
hostname > hostname.txt                             # 2. Returns values ok for me
hostname -s > hostname_slurmcontrol.txt             # 3. Returns values ok for me
scontrol show hostnames > hostname_scontrol.txt     # 4. Returns values ok for me
echo ${SLURM_NODELIST} > hostname_slurmcontrol.txt  # 5. Returns values ok for me

(有关export命令的解释,请参阅this)。根据你的说法,我理解2,3,4和5对你有用,而1则没有。所以你现在可以使用mpirun和合适的选项--host--hostfile

注意scontrol show hostnames(例如,对我来说cnode17<newline>cnode18)和echo ${SLURM_NODELIST}cnode[17-18])的输出的不同格式。

主机名也许也可以在%h中用%nslurm.conf动态设置的文件名中获得,例如查找SlurmdLogFileSlurmdPidFile


To diagnose/work around/solve Problem 1, try mpirun with/without --host, in CLI and S. From what you say, assuming you used the correct syntax in each case, this is the outcome:
  1. mpirun,CLI(原帖)。 “作品”。
  2. mpirun,S(评论?)。与下面的第4项相同的错误?请注意,S中的mpirun hostname应该在slurm.err中产生类似的输出。
  3. mpirun --host,CLI(评论)。错误 There are no allocated resources for the application bin/ua.B.x that match the requested mapping: ... This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes.
  4. mpirun --host,S(原帖)。错误(与上面的第3项相同?) There are no allocated resources for the application bin/ua.B.x that match the requested mapping: ------------------------------------------------------------------ Verify that you have mapped the allocated resources properly using the --host or --hostfile specification. ... This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes.

根据评论,您可能有一个错误的LD_LIBRARY_PATH路径集。您可能还需要使用mpi --prefix ...

有关? https://github.com/easybuilders/easybuild-easyconfigs/issues/204

© www.soinside.com 2019 - 2024. All rights reserved.