一个节点上运行多个单核

Question

我有一个如下所示的 csh 脚本

foreach n (`seq 1 1000000`)
  ./myprog${n}.x
end

我想并行化它并在我的 slurm 集群上运行它，并且因为程序的每个实例只需要 1 个核心，所以我想使用一个节点（或几个节点）一次运行多个节点

#!/bin/csh 
#SBATCH --nodes=8
#SBATCH -n 1024
#SBATCH --ntasks-per-node=128
foreach n (`seq 1 1000000`)
  srun -N 1 -n 1 ./myprog${n}.x &
end
wait

当我这样做时，看起来它在给定节点上一次只运行 1 个，尽管很难说。是否有一个选项可以添加到

srun

或我可以添加的

#SBATCH

标头，以允许我在我请求的所有内核上运行？

Answer 1

执行此操作的方法可能因正在运行的 Slurm 版本而异。然而，这里给出了一个例子：

https://docs.archer2.ac.uk/user-guide/scheduler/#example-4-256-serial-tasks-running-across-two-nodes

注意：这假设您具有独占节点访问权限。本质上，您循环遍历分配给作业的节点，然后循环遍历要放置在其上的任务。基于您的示例作业提交脚本（注意：您需要将

--mem

选项修改为适合您正在使用的计算节点上可用内存总量的值）。

#!/bin/bash
#SBATCH --job-name=MultiSerialOnComputes
#SBATCH --time=0:10:0
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=1


# Get a list of the nodes assigned to this job in a format we can use.
#   scontrol converts the condensed node IDs in the sbatch environment
#   variable into a list of full node IDs that we can use with srun to
#   ensure the subjobs are placed on the correct node. e.g. this converts
#   "nid[001234,002345]" to "nid001234 nid002345"
nodelist=$(scontrol show hostnames $SLURM_JOB_NODELIST)

# Loop over the nodes assigned to the job
for nodeid in $nodelist
do
    # Loop over 128 subjobs on each node pinning each to a different core
    for i in $(seq 1 128)
    do
        # Launch subjob overriding job settings as required and in the background
        # Make sure to change the amount specified by the `--mem=` flag to the amount
        # of memory required. The amount of memory is given in MiB by default but other
        # units can be specified.
        srun --nodelist=${nodeid} --nodes=1 --ntasks=1 --ntasks-per-node=1 \
        --exact --mem=1500M ./myprog${n}.x &
    done
done

# Wait for all subjobs to finish
wait

这无法完成您最初指定的 100000 个任务，但您应该能够想出一些算术来完成这项工作，以便您将任务总数分配给您分配的节点数（或者您可以设置一个一组作业最终每个节点的任务数量完全相同）。

一个节点上运行多个单核

问题描述投票：0回答：1

1个回答

最新问题

一个节点上运行多个单核

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1