在HPC上使用scikit-learn功能的并行选项的简便方法

Question

在scikit-learn的许多功能中实现了用户友好的并行化。例如，在sklearn.cross_validation.cross_val_score中，您只需在n_jobs参数中传递所需数量的计算作业。对于具有多核处理器的PC，它将非常好用。但是，如果我想在高性能集群中使用这样的选项（安装OpenMPI包并使用SLURM进行资源管理）？据我所知，sklearn使用joblib进行并行化，使用multiprocessing。并且，正如我所知（从这个，例如，Python multiprocessing within mpi）与multiprocessing并行化的Python程序易于扩展整个MPI架构与mpirun实用程序。我可以使用sklearn和mpirun参数在几个计算节点上传播n_jobs函数的计算吗？

Answer 1

SKLearn通过Joblib管理其并行性。 Joblib可以替换其他分布式系统（如dask.distributed或IPython Parallel）的多处理后端。有关详细信息，请参阅this issue github页面上的sklearn。

Example using Joblib with Dask.distributed

代码取自上面链接的问题页面。

from sklearn.externals.joblib import parallel_backend

search = RandomizedSearchCV(model, param_space, cv=10, n_iter=1000, verbose=1)

with parallel_backend('dask', scheduler_host='your_scheduler_host:your_port'):
        search.fit(digits.data, digits.target)

这要求您在群集上设置dask.distributed调度程序和工作程序。一般说明可在此处获取：http://distributed.readthedocs.io/en/latest/setup.html

Example using Joblib with `ipyparallel`

代码取自同一问题页面。

from sklearn.externals.joblib import Parallel, parallel_backend, register_parallel_backend

from ipyparallel import Client
from ipyparallel.joblib import IPythonParallelBackend

digits = load_digits()

c = Client(profile='myprofile')
print(c.ids)
bview = c.load_balanced_view()

# this is taken from the ipyparallel source code
register_parallel_backend('ipyparallel', lambda : IPythonParallelBackend(view=bview))

...

with parallel_backend('ipyparallel'):
        search.fit(digits.data, digits.target)

注意：在上面的两个例子中，n_jobs参数似乎都不重要了。

Set up dask.distributed with SLURM

对于SLURM来说，最简单的方法是使用dask-jobqueue项目

>>> from dask_jobqueue import SLURMCluster
>>> cluster = SLURMCluster(project='...', queue='...', ...)
>>> cluster.scale(20)

你也可以使用dask-mpi或Dask's setup documentation提到的其他几种方法

Use dask.distributed directly

或者，您可以设置dask.distributed或IPyParallel集群，然后直接使用这些接口来并行化您的SKLearn代码。以下是SKLearn和Joblib开发人员Olivier Grisel的示例视频，正是在柏林PyData上做的：https://youtu.be/Ll6qWDbRTD0?t=1561

Try Dask-ML

你也可以试试Dask-ML包，它有一个RandomizedSearchCV对象，它与scikit-learn API兼容，但在Dask之上计算实现

https://github.com/dask/dask-ml

pip install dask-ml

在HPC上使用scikit-learn功能的并行选项的简便方法

问题描述投票：29回答：1

1个回答

Example using Joblib with Dask.distributed

Example using Joblib with `ipyparallel`

Set up dask.distributed with SLURM

Use dask.distributed directly

Try Dask-ML

最新问题

在HPC上使用scikit-learn功能的并行选项的简便方法

问题描述 投票：29回答：1

1个回答

Example using Joblib with Dask.distributed

Example using Joblib with ipyparallel

Set up dask.distributed with SLURM

Use dask.distributed directly

Try Dask-ML

最新问题

问题描述投票：29回答：1

Example using Joblib with `ipyparallel`