因此,我一直在与安装slurm进行斗争,并且确实感到茫然。 我的目标是在单台计算机上安装Slurm并从同一台计算机上提交作业。(通过sbatch或srun)
[最初,我尝试通过apt install slurm-llnl
安装,但该版本落后于Ubuntu 16.04.3.。
因此,下一步是从源代码编译Slurm。下载并解压缩我运行过的tarball
./configure --prefix=/etc/init.d/ --sysconfdir=/etc/slurm-llnl/
make
make install
然后我添加了以下/etc/ld.so.conf.d/SlurmLib.conf
/etc/init.d/lib /etc/init.d/lib/slurm
然后我创建了cgroup.conf,slurm.conf和slurmdb.conf。
[cgroup.conf]
CgroupAutomount=yes ConstrainCores=no ConstrainRAMSpace=no
[slurm.conf]
ControlMachine=arroyavelab15 AuthType=auth/none CryptoType=crypto/munge MpiDefault=none ProctrackType=proctrack/cgroup ReturnToService=1 SlurmctldPidFile=/var/slurm_dir/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/slurm_dir/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/slurm_dir/spool/slurmd/ SlurmUser=danielsauceda SlurmdUser=danielsauceda StateSaveLocation=/var/slurm_dir/spool SwitchType=switch/none TaskPlugin=task/none InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 FastSchedule=1 SchedulerType=sched/backfill SelectType=select/linear AccountingStorageType=accounting_storage/none AccountingStoreJobComment=YES ClusterName=cluster JobCompType=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=5 SlurmctldLogFile=/var/slurm_dir/slurmctld.log SlurmdDebug=3 NodeName=arroyavelab15 NodeAddr=xxx.xxx.xxx.xxx.xx CPUs=1 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN RealMemory=8000 PartitionName=debug Nodes=arroyavelab15 Default=YES MaxTime=INFINITE State=UP
[slurmdb.conf]
# slurmDBD info DbdAddr=localhost DbdHost=localhost SlurmUser=danielsauceda DebugLevel=4 PidFile=/var/run/slurmdbd.pid # # Database info StorageType=accounting_storage/mysql StoragePass=slurm StorageUser=slurm
最后等待之后
./slurmctld -D ./slurmd -D ./slurmdbd -Dv
它们似乎都在运行(在单独的终端中)
但是执行时
srun -N3 --nodes=1 --ntasks-per-node=1 hostname
我得到以下内容
srun: error: Couldn't find the specified plugin name for auth/munge looking at all files srun: error: cannot find auth plugin for auth/munge srun: error: cannot create auth context for auth/munge srun: error: Couldn't find the specified plugin name for auth/munge looking at all files srun: error: cannot find auth plugin for auth/munge srun: error: cannot create auth context for auth/munge srun: error: Couldn't find the specified plugin name for auth/munge looking at all files srun: error: cannot find auth plugin for auth/munge srun: error: cannot create auth context for auth/munge srun: error: authentication: authentication initialization failure srun: error: Srun communication socket apparently being written to by something other than Slurm srun: error: Unable to allocate resources: Protocol authentication error
我不知道问题是什么,在线研究并没有太大帮助。
因此,我一直在与安装slurm进行斗争,并且确实感到茫然。我的目标是将Slurm安装在单台计算机上,然后从同一台计算机上提交作业。(通过sbatch或srun)...
从软件包管理器安装munge,然后构建slurm --with-munge =选项,auth_munge.so应该出现在$ PREFIX / lib / slurm下