基于 Infiniband Pkey 的 OpenMPI

问题描述 投票:0回答:1

我正在尝试通过 InfiniBand Pkey 网络使用 openmpi。 我似乎找不到任何有关如何执行此操作的文档。 我正在使用 openmpi 4.1.2 和一个简单的ring_c。

mpiexec -n 8 -host cn0001-pkey,cn0002-pkey /home/ring_c/ring_c

网络:

cn0001
ib1 (default pkey) 10.1.0.101
ib1.00a0 (pkey 0xa0) 10.2.0.101

cn0002
ib1 (default pkey) 10.1.0.102
ib1.00a0 (pkey 0xa0) 10.2.0.102

我设置了 /etc/hosts 文件,这样

10.2.0.101  cn0001-pkey
10.2.0.102  cn0002-pkey

这似乎在 cn0001 上执行,但在 cn0002 上挂起。我不认为它实际上是通过 pkey 网络进行的。我做错了什么?

mpiexec -n 8 -host cn0001-pkey,cn0002-pkey -mca plm_base_verbose 99 /home/ring_c/ring_c
[cn0001:05516] mca: base: components_register: registering framework plm components
[cn0001:05516] mca: base: components_register: found loaded component isolated
[cn0001:05516] mca: base: components_register: component isolated has no register or open function
[cn0001:05516] mca: base: components_register: found loaded component rsh
[cn0001:05516] mca: base: components_register: component rsh register function successful
[cn0001:05516] mca: base: components_register: found loaded component slurm
[cn0001:05516] mca: base: components_register: component slurm register function successful
[cn0001:05516] mca: base: components_open: opening plm components
[cn0001:05516] mca: base: components_open: found loaded component isolated
[cn0001:05516] mca: base: components_open: component isolated open function successful
[cn0001:05516] mca: base: components_open: found loaded component rsh
[cn0001:05516] mca: base: components_open: component rsh open function successful
[cn0001:05516] mca: base: components_open: found loaded component slurm
[cn0001:05516] mca: base: components_open: component slurm open function successful
[cn0001:05516] mca:base:select: Auto-selecting plm components
[cn0001:05516] mca:base:select:(  plm) Querying component [isolated]
[cn0001:05516] mca:base:select:(  plm) Query of component [isolated] set priority to 0
[cn0001:05516] mca:base:select:(  plm) Querying component [rsh]
[cn0001:05516] mca:base:select:(  plm) Query of component [rsh] set priority to 10
[cn0001:05516] mca:base:select:(  plm) Querying component [slurm]
[cn0001:05516] mca:base:select:(  plm) Selected component [rsh]
[cn0001:05516] mca: base: close: component isolated closed
[cn0001:05516] mca: base: close: unloading component isolated
[cn0001:05516] mca: base: close: component slurm closed
[cn0001:05516] mca: base: close: unloading component slurm
[cn0001:05516] [[33746,0],0] plm:rsh: final template argv:
        /usr/bin/ssh <template> ( test ! -r ./.profile || . ./.profile;           PATH=/opt/modules/openmpi/gcc/4.1.2/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/modules/openmpi/gcc/4.1.2/lib:${LD_LIBRARY_PATH:-} ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/opt/modules/openmpi/gcc/4.1.2/lib:${DYLD_LIBRARY_PATH:-} ; export DYLD_LIBRARY_PATH ;   /opt/modules/openmpi/gcc/4.1.2/bin/orted -mca ess "env" -mca ess_base_jobid "2211577856" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -mca orte_node_regex "cn[4:1],cn[4:2]-pkey@0(2)" -mca orte_hnp_uri "2211577856.0;tcp://16.1.15.2,10.1.0.201,10.2.0.101:38393" -mca plm_base_verbose "99" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "2211577856.0;tcp://16.1.15.2,10.1.0.201,10.2.0.101:38393" -mca pmix "^s1,s2,cray,isolated" )
[cn0002:05935] mca: base: components_register: registering framework plm components
[cn0002:05935] mca: base: components_register: found loaded component rsh
[cn0002:05935] mca: base: components_register: component rsh register function successful
[cn0002:05935] mca: base: components_open: opening plm components
[cn0002:05935] mca: base: components_open: found loaded component rsh
[cn0002:05935] mca: base: components_open: component rsh open function successful
[cn0002:05935] mca:base:select: Auto-selecting plm components
[cn0002:05935] mca:base:select:(  plm) Querying component [rsh]
[cn0002:05935] mca:base:select:(  plm) Query of component [rsh] set priority to 10
[cn0002:05935] mca:base:select:(  plm) Selected component [rsh]
hpc openmpi infiniband
1个回答
0
投票

我想我已经弄清楚了。它在默认 pkey 上工作,因为子网管理器中的 MTU 设置得低于 pkeyA 的 MTU。由于某种原因,它在该 IB 分区上适用于 MTU=2048,但不适用于 MTU=4096。我想这是另一天的问题。

© www.soinside.com 2019 - 2024. All rights reserved.