任务扩展问题:如果增加工人数量,打开的文件太多

问题描述 投票:0回答:1

我从命令行运行 SSH 集群。每个节点有 32 个 CPU。

dask ssh --hostfile $PBS_NODEFILE --nworkers 32 --nthreads 1 &

代码:

import dask
from dask.distributed import Client

# items are individual molecules
# mol_dock is the function to process them (takes 1-20 min)
for future, res in as_completed(dask_client.map(mol_dock, items), with_results=True):
    ...  # process res

mol_dock
函数在子进程 shell 中运行一个命令,该命令接受两个输入文件并创建一个输出 json 文件,该文件由
mol_dock
函数解析并返回结果。

如果我在 14 个节点上运行代码,它运行正常,如果我选择更多节点,它开始产生如下“太多打开的文件”下面的错误。这会导致许多计算失败并重新启动。最后所有的计算都成功完成了,但是由于重启造成的开销是巨大的。

[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] : 2023-04-24 17:26:08,591 - tornado.application - ERROR - Exception in callback <bound method SystemMonitor.update of <SystemMonitor: cpu: 20 memory: 254 MB fds:
 2048>>
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] : Traceback (most recent call last):
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/psutil/_common.py", line 443, in wrapper
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] : KeyError: <function Process._parse_stat_file at 0x7f7f37502820>
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] : 
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] : During handling of the above exception, another exception occurred:
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] : 
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] : Traceback (most recent call last):
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/tornado/ioloop.py", line 921, in _run
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/distributed/system_monitor.py", line 128, in update
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/psutil/__init__.py", line 999, in cpu_percent
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/psutil/_pslinux.py", line 1645, in wrapper
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/psutil/_pslinux.py", line 1836, in cpu_times
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/psutil/_pslinux.py", line 1645, in wrapper
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/psutil/_common.py", line 450, in wrapper
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/psutil/_pslinux.py", line 1687, in _parse_stat_file
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/psutil/_common.py", line 776, in bcat
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/psutil/_common.py", line 764, in cat
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] :   File "/home/pavlop/anaconda3/envs/vina_cache/lib/python3.9/site-packages/psutil/_common.py", line 728, in open_binary
[ ESC[1mscheduler 158.194.103.68:8786ESC[0m ] : OSError: [Errno 24] Too many open files: '/proc/18692/stat'

我们将软硬限制增加到 1000000 但这无济于事。我们按照

FAQ
中的建议,通过增加特定用户/etc/security/limits.conf的限制来做到这一点。

我好像漏掉了什么。是否还有其他一些值得尝试的调整或检查?在 Linux 上打开文件的数量是否有其他一些限制?实际上,对于软限制 100000 观察到相同的行为。因此,它似乎没有效果。

dask dask-distributed
1个回答
0
投票

问题是限制只在集群的主节点上增加,而不是在单个计算节点上。修复后一切开始工作。

© www.soinside.com 2019 - 2024. All rights reserved.