我正在构建一个新的slurm集群,我不太熟悉资源如何分配。我有 4 个节点,每个节点有 32 个核心。当我提交作业时,每个节点仅运行 1 个作业,其余作业处于待处理状态。
所有作业都应该是单线程的,并且只占用一个核心。我怎样才能让其他作业运行?每个节点应该能够同时运行 32 个。以下是 squeue 的输出:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
249 computes freesurf nidb PD 0:00 1 (Resources)
250 computes freesurf nidb PD 0:00 1 (Priority)
251 computes freesurf nidb PD 0:00 1 (Priority)
252 computes freesurf nidb PD 0:00 1 (Priority)
253 computes freesurf nidb PD 0:00 1 (Priority)
254 computes freesurf nidb PD 0:00 1 (Priority)
255 computes freesurf nidb PD 0:00 1 (Priority)
256 computes freesurf nidb PD 0:00 1 (Priority)
257 computes freesurf nidb PD 0:00 1 (Priority)
258 computes freesurf nidb PD 0:00 1 (Priority)
259 computes freesurf nidb PD 0:00 1 (Priority)
260 computes freesurf nidb PD 0:00 1 (Priority)
261 computes freesurf nidb PD 0:00 1 (Priority)
262 computes freesurf nidb PD 0:00 1 (Priority)
263 computes freesurf nidb PD 0:00 1 (Priority)
245 computes freesurf nidb R 8:00 1 compute60
246 computes freesurf nidb R 7:40 1 compute61
247 computes freesurf nidb R 7:19 1 compute62
248 computes freesurf nidb R 6:55 1 compute63
以及 slurm.conf 中的节点和分区
NodeName=compute60 NodeAddr=10.35.10.110 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=515827 State=UNKNOWN
NodeName=compute61 NodeAddr=10.35.10.111 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=515827 State=UNKNOWN
NodeName=compute62 NodeAddr=10.35.10.112 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=515827 State=UNKNOWN
NodeName=compute63 NodeAddr=10.35.10.113 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=515827 State=UNKNOWN
PartitionName=computes Nodes=compute60,compute61,compute62,compute63 Default=NO MaxTime=INFINITE State=UP