mpi“对于 UD mlx5 在 mlx5_0 上的连接失败:没有此类设备”

问题描述 投票:0回答:1

mpi 错误如下

[1689646357.071467] [05af046533e9:124545:0]       ib_device.c:1466 UCX  ERROR   ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=fe80::1270:fdff:fe44:5170 sgid_index=0 traffic_class=0) for UD mlx5 connect on mlx5_0 failed: No such device
[1689646357.072612] [05af046533e9:124545:0]      ucp_worker.c:2657 UCX  WARN  worker 0x55741a624b40: 1 pending operations were not flushed
Abort(138006287) on node 0 (rank 0 in comm 0): Fatal error in internal_Init_thread: Other MPI error, error stack:
internal_Init_thread(60)......: MPI_Init_thread(argc=0x7ffeb4a07608, argv=0x7ffeb4a07610, required=1, provided=0x7ffeb4a0760c) failed
MPII_Init_thread(232).........: 
MPIR_init_comm_world(34)......: 
MPIR_Comm_commit(722).........: 
MPIR_Comm_commit_internal(510): 
MPID_Comm_commit_pre_hook(158): 
MPIDI_UCX_init_world(288).....: 
initial_address_exchange(145).:  ucx function returned with failed status(ucx_init.c 145 initial_address_exchange Address not valid)

错误是当我在 docker 中使用 mpi 时。 当我编写一个 hello-world cpp 文件并编译它并运行

mpirun -np 2 ./hello

docker mpi docker-ucp
1个回答
0
投票

可能你还没有安装ib/roce网卡驱动

© www.soinside.com 2019 - 2024. All rights reserved.