将SIGTERM发送到正在运行的任务,dask已分发

问题描述 投票:0回答:1

当我提交一个小的Tensorflow培训作为单个任务时,它会启动其他线程。当我按下Ctrl+C并提升KeyboardInterrupt时,我的任务被关闭但是基础线程没有被清理并且训练继续进行。

最初,我认为这是Tensorflow的问题(不是清理线程),但经过测试,我明白问题来自Dask方面,可能不会将SIGTERM信号进一步填充到任务函数中。我的问题是,如何设置Dask以将SIGTERM信号填充到正在运行的任务中?

期望流量的示例:

本地进程 - >按Ctrl + C - > Dask调度程序 - > Dask worker - > SIGTERM信号 - >使用Tensorflow培训运行单个任务。

谢谢。

P.S如果您需要其他信息,请询问。

更新:

代码示例:

c = Client('<remote-scheduler>')

def task():
  # tensorflow training
  model = ...
  model.fit(x_train, y_train)

training = c.submit(task)
training.result()

现在,在训练期间,当我按下Ctrl+C任务被取消,但是tensorflow线程/进程仍然存在。

更新2:ps -f -u [username]命令输出。

Dask集群(1个调度程序,1个worker,同一个服务器),没有正在运行的任务:

UID        PID  PPID  C STIME TTY          TIME CMD
vladysl+ 16547     1  0 12:40 ?        00:00:00 /lib/systemd/systemd --user
vladysl+ 16550 16547  0 12:40 ?        00:00:00 (sd-pam)
vladysl+ 16805 16311  0 12:40 ?        00:00:00 sshd: vladyslav@pts/45
vladysl+ 16811 16805  0 12:40 pts/45   00:00:00 -bash
vladysl+ 18946 16811  4 12:41 pts/45   00:00:24 /home/vladyslav/miniconda3/envs/py3.6/bin/python /home/vladyslav/miniconda3/envs/py3.6/bin/dask-scheduler --port 42001
vladysl+ 22284 22175  0 12:46 ?        00:00:00 sshd: vladyslav@pts/38
vladysl+ 22285 22284  0 12:46 pts/38   00:00:00 -bash
vladysl+ 23138 16811  1 12:48 pts/45   00:00:03 /home/vladyslav/miniconda3/envs/py3.6/bin/python /home/vladyslav/miniconda3/envs/py3.6/bin/dask-worker localhost:42001 --worker-port 420011 --memory-limit $
vladysl+ 23143 23138  0 12:48 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.semaphore_tracker import main;main(11)
vladysl+ 23145 23138  0 12:48 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.forkserver import main; main(15, 16, ['distributed'], **{'sys_path': ['/home/vlady$
vladysl+ 23151 23145 99 12:48 pts/45   00:03:48 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.forkserver import main; main(15, 16, ['distributed'], **{'sys_path': ['/home/vlady$
vladysl+ 23536 23151  0 12:49 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.semaphore_tracker import main;main(25)
vladysl+ 26150 22285  0 12:51 pts/38   00:00:00 ps -f -u vladyslav

在任务运行期间:

UID        PID  PPID  C STIME TTY          TIME CMD
vladysl+ 16547     1  0 12:40 ?        00:00:00 /lib/systemd/systemd --user
vladysl+ 16811 16805  0 12:40 pts/45   00:00:00 -bash
vladysl+ 18946 16811  4 12:41 pts/45   00:00:30 /home/vladyslav/miniconda3/envs/py3.6/bin/python /home/vladyslav/miniconda3/envs/py3.6/bin/dask-scheduler --port 42001
vladysl+ 22285 22284  0 12:46 pts/38   00:00:00 -bash
vladysl+ 23138 16811  1 12:48 pts/45   00:00:06 /home/vladyslav/miniconda3/envs/py3.6/bin/python /home/vladyslav/miniconda3/envs/py3.6/bin/dask-worker localhost:42001 --worker-port 420011 --memory-limit $
vladysl+ 23143 23138  0 12:48 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.semaphore_tracker import main;main(11)
vladysl+ 23145 23138  0 12:48 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.forkserver import main; main(15, 16, ['distributed'], **{'sys_path': ['/home/vlady$
vladysl+ 23151 23145 99 12:48 pts/45   00:07:55 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.forkserver import main; main(15, 16, ['distributed'], **{'sys_path': ['/home/vlady$
vladysl+ 23536 23151  0 12:49 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.semaphore_tracker import main;main(25)
vladysl+ 27079 22285  0 12:54 pts/38   00:00:00 ps -f -u vladyslav

按下Ctrl+C后,任务被取消,但tensorflow继续工作:

UID        PID  PPID  C STIME TTY          TIME CMD
vladysl+ 16811 16805  0 12:40 pts/45   00:00:00 -bash
vladysl+ 18946 16811  4 12:41 pts/45   00:00:31 /home/vladyslav/miniconda3/envs/py3.6/bin/python /home/vladyslav/miniconda3/envs/py3.6/bin/dask-scheduler --port 42001
vladysl+ 22285 22284  0 12:46 pts/38   00:00:00 -bash
vladysl+ 23138 16811  1 12:48 pts/45   00:00:06 /home/vladyslav/miniconda3/envs/py3.6/bin/python /home/vladyslav/miniconda3/envs/py3.6/bin/dask-worker localhost:42001 --worker-port 420011 --memory-limit $
vladysl+ 23143 23138  0 12:48 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.semaphore_tracker import main;main(11)
vladysl+ 23145 23138  0 12:48 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.forkserver import main; main(15, 16, ['distributed'], **{'sys_path': ['/home/vlady$
vladysl+ 23151 23145 99 12:48 pts/45   00:09:32 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.forkserver import main; main(15, 16, ['distributed'], **{'sys_path': ['/home/vlady$
vladysl+ 23536 23151  0 12:49 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.semaphore_tracker import main;main(25)
vladysl+ 27117 22285  0 12:54 pts/38   00:00:00 ps -f -u vladyslav

你可以看到没有新的东西出现。

python-3.x tensorflow dask concurrent.futures dask-distributed
1个回答
0
投票

Dask不支持将信号从客户端传播到运行任务的工作人员。

© www.soinside.com 2019 - 2024. All rights reserved.