我一直在用MPI4py数组做一些工作,最近我在使用Scatterv()
函数后遇到了性能提升。我已经开发了一个代码来检查输入对象的数据类型,如果它是一个数字numpy数组,它使用Scatterv()
执行散射,否则它使用正确实现的函数执行。
代码如下所示:
import numpy as np
from mpi4py import MPI
import cProfile
from line_profiler import LineProfiler
def ScatterV(object, comm, root = 0):
optimize_scatter, object_type = np.zeros(1), None
if rank == root:
if isinstance(object, np.ndarray):
if object.dtype in [np.float64, np.float32, np.float16, np.float,
np.int, np.int8, np.int16, np.int32, np.int64]:
optimize_scatter = 1
object_type = object.dtype
else: optimize_scatter, object_type = 0, None
else: optimize_scatter, object_type = 0, None
optimize_scatter = np.array(optimize_scatter, dtype=np.float64).ravel()
comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
object_type = comm.bcast(object_type, root=root)
if int(optimize_scatter) == 1:
if rank == root:
displs = [int(i)*object.shape[1] for i in
np.linspace(0, object.shape[0], comm.size + 1)]
counts = [displs[i+1] - displs[i] for i in range(len(displs)-1)]
lens = [int((displs[i+1] - displs[i])/(object.shape[1]))
for i in range(len(displs)-1)]
displs = displs[:-1]
shape = object.shape
object = object.ravel().astype(np.float64, copy=False)
else:
object, counts, displs, shape, lens = None, None, None, None, None
counts = comm.bcast(counts, root=root)
displs = comm.bcast(displs, root=root)
lens = comm.bcast(lens, root=root)
shape = list(comm.bcast(shape, root=root))
shape[0] = lens[rank]
shape = tuple(shape)
x = np.zeros(counts[rank])
comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)
return np.reshape(x, (-1,) + shape[1:]).astype(object_type, copy=False)
else:
return comm.scatter(object, root=root)
comm = MPI.COMM_WORLD
size, rank = comm.Get_size(), comm.Get_rank()
if rank == 0:
arra = (np.random.rand(10000000, 10) * 100).astype(np.float64, copy=False)
else: arra = None
lp = LineProfiler()
lp_wrapper = lp(ScatterV)
lp_wrapper(arra, comm)
if rank == 4: lp.print_stats()
pr = cProfile.Profile()
pr.enable()
f2 = ScatterV(arra, comm)
pr.disable()
if rank == 4: pr.print_stats()
使用LineProfiler
进行分析得出以下结果[仅显示冲突线]:
Timer unit: 1e-06 s
Total time: 2.05001 s
File: /media/SETH_DATA/SETH_Alex/BigMPI4py/prueba.py
Function: ScatterV at line 26
Line # Hits Time Per Hit % Time Line Contents
==============================================================
...
41 1 1708453.0 1708453.0 83.3 comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
42 1 148.0 148.0 0.0 object_type = comm.bcast(object_type, root=root)
...
76 1 264.0 264.0 0.0 counts = comm.bcast(counts, root=root)
77 1 16.0 16.0 0.0 displs = comm.bcast(displs, root=root)
78 1 14.0 14.0 0.0 lens = comm.bcast(lens, root=root)
79 1 9.0 9.0 0.0 shape = list(comm.bcast(shape, root=root))
...
86 1 340971.0 340971.0 16.6 comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)
使用cProfile
进行分析得出以下结果:
17 function calls in 0.462 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.127 0.127 0.127 0.127 {method 'Bcast' of 'mpi4py.MPI.Comm' objects}
1 0.335 0.335 0.335 0.335 {method 'Scatterv' of 'mpi4py.MPI.Comm' objects}
5 0.000 0.000 0.000 0.000 {method 'bcast' of 'mpi4py.MPI.Comm' objects}
在这两种情况下,与Bcast
方法相比,ScatterV
方法消耗了大量时间。甚至更多,使用LinePprofiler
,Bcast
方法比ScatterV
方法慢5倍,这对我来说似乎完全不连贯,因为Bcast
只播放10个元素的数组。
如果我交换第41和42行,结果如下:
LineProfiler
41 1 1666718.0 1666718.0 83.0 object_type = comm.bcast(object_type, root=root)
42 1 47.0 47.0 0.0 comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
87 1 341728.0 341728.0 17.0 comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)
cProfile
1 0.000 0.000 0.000 0.000 {method 'Bcast' of 'mpi4py.MPI.Comm' objects}
1 0.339 0.339 0.339 0.339 {method 'Scatterv' of 'mpi4py.MPI.Comm' objects}
5 0.129 0.026 0.129 0.026 {method 'bcast' of 'mpi4py.MPI.Comm' objects}
如果我改变要散布的阵列的大小,ScatterV
和Bcast
的时间消耗也会以相同的速率变化。例如,如果我将大小增加10倍(100000000),结果是:
LineProfiler
41 1 16304301.0 16304301.0 82.8 comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
42 1 235.0 235.0 0.0 object_type = comm.bcast(object_type, root=root)
87 1 3393658.0 3393658.0 17.2 comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)
cProfile
1 1.348 1.348 1.348 1.348 {method 'Bcast' of 'mpi4py.MPI.Comm' objects}
1 4.517 4.517 4.517 4.517 {method 'Scatterv' of 'mpi4py.MPI.Comm' objects}
5 0.000 0.000 0.000 0.000 {method 'bcast' of 'mpi4py.MPI.Comm' objects}
如果不是选择等级4的结果,而是为任何等级> 1选择它们,则会发生相同的结果。但是,对于rank = 0,结果不同:
LineProfiler
41 1 186.0 186.0 0.0 comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
42 1 244.0 244.0 0.0 object_type = comm.bcast(object_type, root=root)
87 1 4722349.0 4722349.0 100.0 comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)
cProfile
1 0.000 0.000 0.000 0.000 {method 'Bcast' of 'mpi4py.MPI.Comm' objects}
1 5.921 5.921 5.921 5.921 {method 'Scatterv' of 'mpi4py.MPI.Comm' objects}
5 0.000 0.000 0.000 0.000 {method 'bcast' of 'mpi4py.MPI.Comm' objects}
在这种情况下,Bcast
方法具有与其他bcast
方法类似的计算时间。
我也尝试过,使用Bcast
和bcast
,而不是使用scatter
on第41行,它产生相同的结果。
鉴于此,我认为增加的时间消耗仅错误地归因于第一次广播,这意味着两个分析器都会产生并行化过程的错误时序。
我很确定分析器的内部结构不能用于可并行化的函数,但我发布这个问题是为了知道是否有人经历过类似的结果。
为了回应Gilles Gouaillardet,我在每次comm.Barrier()
召唤之前和之后都将bcast
包括在线路中,并且大部分信号在这些comm.Barrier()
呼叫中总结。
这是LineProfiler
的一个例子。
Timer unit: 1e-06 s
Total time: 2.17248 s
File: /media/SETH_DATA/SETH_Alex/BigMPI4py/prueba.py
Function: ScatterV at line 26
Line # Hits Time Per Hit % Time Line Contents
==============================================================
26 def ScatterV(object, comm, root = 0):
27 1 7.0 7.0 0.0 optimize_scatter, object_type = np.zeros(1), None
28
29 1 2.0 2.0 0.0 if rank == root:
30 if isinstance(object, np.ndarray):
31 if object.dtype in [np.float64, np.float32, np.float16, np.float,
32 np.int, np.int8, np.int16, np.int32, np.int64]:
33 optimize_scatter = 1
34 object_type = object.dtype
35
36 else: optimize_scatter, object_type = 0, None
37 else: optimize_scatter, object_type = 0, None
38
39 optimize_scatter = np.array(optimize_scatter, dtype=np.float64).ravel()
40
41 1 1677662.0 1677662.0 77.2 comm.Barrier()
42 1 76.0 76.0 0.0 comm.Bcast([optimize_scatter, 1, MPI.DOUBLE], root=root)
43 1 345.0 345.0 0.0 comm.Barrier()
44 1 111.0 111.0 0.0 object_type = comm.bcast(object_type, root=root)
45 1 166.0 166.0 0.0 comm.Barrier()
46
47
48
49 1 7.0 7.0 0.0 if int(optimize_scatter) == 1:
50
51 1 2.0 2.0 0.0 if rank == root:
52 if object.ndim > 1:
53 displs = [int(i)*object.shape[1] for i in
54 np.linspace(0, object.shape[0], comm.size + 1)]
55 else:
56 displs = [int(i) for i in np.linspace(0, object.shape[0], comm.size + 1)]
57
58 counts = [displs[i+1] - displs[i] for i in range(len(displs)-1)]
59
60 if object.ndim > 1:
61 lens = [int((displs[i+1] - displs[i])/(object.shape[1]))
62 for i in range(len(displs)-1)]
63 else:
64 lens = [displs[i+1] - displs[i] for i in range(len(displs)-1)]
65
66 displs = displs[:-1]
67
68
69 shape = object.shape
70
71
72
73 if object.ndim > 1:
74 object = object.ravel().astype(np.float64, copy=False)
75
76
77 else:
78 1 2.0 2.0 0.0 object, counts, displs, shape, lens = None, None, None, None, None
79
80 1 295.0 295.0 0.0 counts = comm.bcast(counts, root=root)
81 1 66.0 66.0 0.0 displs = comm.bcast(displs, root=root)
82 1 6.0 6.0 0.0 lens = comm.bcast(lens, root=root)
83 1 9.0 9.0 0.0 shape = list(comm.bcast(shape, root=root))
84
85 1 2.0 2.0 0.0 shape[0] = lens[rank]
86 1 3.0 3.0 0.0 shape = tuple(shape)
87
88 1 33.0 33.0 0.0 x = np.zeros(counts[rank])
89
90 1 76.0 76.0 0.0 comm.Barrier()
91 1 351187.0 351187.0 16.2 comm.Scatterv([object, counts, displs, MPI.DOUBLE], x, root=root)
92 1 142352.0 142352.0 6.6 comm.Barrier()
93
94 1 5.0 5.0 0.0 if len(shape) > 1:
95 1 66.0 66.0 0.0 return np.reshape(x, (-1,) + shape[1:]).astype(object_type, copy=False)
96 else:
97 return x.view(object_type)
98
99
100 else:
101 return comm.scatter(object, root=root)
77.2%的时间用在第一个comm.Barrier()
元素上,所以我可以安全地假设bcast
呼叫都没有花费如此大量的时间。我会考虑添加comm.Barrier()
calls以用于未来的分析。