Hyper-V虚拟机上的Slurm计算节点,从Hyper-V请求RAM

问题描述 投票:0回答:1

我正在尝试在由HyperV管理的虚拟机上运行slurm计算节点。该节点运行Ubuntu 16.04。

slurmd -C显示:

NodeName=calc1 CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=48 ThreadsPerCore=1 RealMemory=16013
UpTime=5-20:51:31

这不是绝对正确的,该机器可用的最大RAM量为96Gb,但RAM是根据请求由HyperV分配的。如果没有加载,节点只有16 Gb。

我已经尝试运行一些处理大数据集的python脚本,没有slurm,并且已经看到最大RAM增加到96Gb。

我的slurmd.conf(以及其他行)中有以下内容:

SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
FastSchedule=1

DefMemPerCPU=2048
NodeName=calc1 CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=48 ThreadsPerCore=1 RealMemory=96000 CoreSpecCount=8 MemSpecLimit=6000

但是,htop显示只加载了8个核心,40个空闲。而Mem只有16Gb。

由于“低实际内存”,有时节点会落到Drained状态。貌似slurmd不相信我在slurm.conf写的东西

如何让slurmd请求其他千兆字节的RAM?

UPDATE

我仍然没有应用@Carles Fenoy提出的配置更改,但观察到一个奇怪的细节。

scontrol show node的输出:

NodeName=calc1 Arch=x86_64 CoresPerSocket=48
   CPUAlloc=40 CPUErr=0 CPUTot=48 CPULoad=10.25
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=calc1 NodeHostName=calc1 Version=17.11
   OS=Linux 4.4.0-145-generic #171-Ubuntu SMP Tue Mar 26 12:43:40 UTC 2019
   RealMemory=96000 AllocMem=81920 FreeMem=179 Sockets=1 Boards=1
   CoreSpecCount=8 CPUSpecList=40-47 MemSpecLimit=6000
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=main
   BootTime=2019-04-12T12:50:39 SlurmdStartTime=2019-04-18T09:24:29
   CfgTRES=cpu=48,mem=96000M,billing=48
   AllocTRES=cpu=40,mem=80G
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

然后我ssh到calc1并发出free -h。这是它的输出:

~$ free -h
              total        used        free      shared  buff/cache   available
Mem:            15G         14G        172M        520K        1.1G         77M
Swap:           15G        644M         15G

更新2我已经与我们的基础设施专家讨论过这个问题,并且已经发现这个机制叫做Hyper-V Dynamic Memory

将尝试查找Microsoft是否向虚拟机提供任何API。可能我会很幸运,有人为它开发了slurm插件。

hyper-v slurm
1个回答
2
投票

FastSchedule参数更改为02

这是slurm.conf文档的摘录:

   FastSchedule
          Controls  how a node's configuration specifications in slurm.conf are used.  If the number of node configuration entries in the configuration file is significantly lower than the number of nodes, setting FastSchedule
          to 1 will permit much faster scheduling decisions to be made.  (The scheduler can just check the values in a few configuration records instead of possibly thousands of  node  records.)   Note  that  on  systems  with
          hyper-threading, the processor count reported by the node will be twice the actual processor count.  Consider which value you want to be used for scheduling purposes.

          0    Base  scheduling  decisions  upon the actual configuration of each individual node except that the node's processor count in Slurm's configuration must match the actual hardware configuration if PreemptMode=sus-
               pend,gang or SelectType=select/cons_res are configured (both of those plugins maintain resource allocation information using bitmaps for the cores in the system and must remain static, while  the  node's  memory
               and disk space can be established later).

          1 (default)
               Consider the configuration of each node to be that specified in the slurm.conf configuration file and any node with less than the configured resources will be set to DRAIN.

          2    Consider the configuration of each node to be that specified in the slurm.conf configuration file and any node with less than the configured resources will not be set DRAIN.  This option is generally only useful
               for testing purposes.
© www.soinside.com 2019 - 2024. All rights reserved.