我进行数值模拟。我有50,000个在一个盒子中结晶的粒子,其中有几个晶体(具有周期性边界条件)。我试图找到盒子里最大的水晶。该代码运行良好,但是在我的计算机上大约需要100 [s]
。速率确定步骤是下面的函数find_neighbors()
。
是否有使用并行计算的方法来缩短计算时间?
我的计算机有12个核心。也许我们可以将阵列(中心)分成12个部分?
我的函数的输出是二维列表。
这里是find_neighbors()
函数,编写时没有并行计算:
# Calculate if two particles are neighbors (Periodic Boundary Condition)
# center is the coordinates of the particles
# d is the distance threshold for two clusters to be neighbors
# Maxima_Range and
# Lattice_Range are for periodic boundary conditions
# Pos_3d is a list.
# Suppose particle i has neighbors j,k,l.
# Pos_3d[i] = [j,k,l]
def find_neighbors(center,d,Maxima_Range,Lattice_Range):
a,b = np.shape(center) # get the dimension of the array
flag_3d = np.zeros(a) # count the number of neighbors for particle i
Pos_3d = [[] for i in range(a)] # Pos_3d is a two dimmensional list.
# # Pos_3d[i] includes the indices of all the neighbors of particle i
diff_sqrt = d ** 0.5
for i in range(a):
if i % 6 == 5: # for some reason, when i % 6 == 5,
# # it is the center coordinates,
# # this cannot be changed.
for j in range(i+1,a):
if j % 6 == 5:
diff_x = center[i][0]-center[j][0]
diff_y = center[i][1]-center[j][1]
diff_z = center[i][2]-center[j][2]
if ( diff_x < diff_sqrt
and diff_y < diff_sqrt
and diff_z < diff_sqrt
and diff_x > -diff_sqrt
and diff_y > -diff_sqrt
and diff_z > -diff_sqrt ):
# # if one of the x,y,z component is already
# # bigger than d, we don't need
# # to calculate their distance in 3-D.
# # They are not neighbors.
diff = ( diff_x * diff_x
+ diff_y * diff_y
+ diff_z * diff_z
)
if diff <= d: # without period boundary condition
flag_3d[i] +=1
flag_3d[j] +=1
Pos_3d[i].append(j)
Pos_3d[j].append(i)
continue
# With periodic boundary condition
# in x
do something..
return Pos_3d
是,有。Q:“是否有[[一种使用并行计算的方法来缩短计算时间?“
但是,在任何人开始尝试之前,请修复您的原始代码,以避免任何和所有开销:
最好避免使用列表,但最好附加numpy可矢量化和高效处理的数据结构:
aVectorOfDIFFs = center[i] - center[j] # aVectorOfDIFFs.shape ( 3, ) # _x, _y, _z
if d >= np.dot( aVectorOfDIFFs, # diff == ( ( _x * _x )
aVectorOfDIFFs.T # + ( _y * _y )
): # + ( _z * _z ) )
# makes sense to do anything ... with ( i, j )
...
除了对代码进行智能的自上而下的重构到性能最佳的-向量化处理(如果对最终性能感兴趣,我们可以稍后再讨论),只是简单的串联原始的[[
numpy
for
-loops效率很低,大部分时间都花在了其中,python多年来一直对此有所了解,缓慢的循环迭代。它可以按原样运行~ 20x
更快:>>> from zmq import Stopwatch; aClk = Stopwatch()
>>> def aPairOfNaiveFORs( a = 1E6 ):
... intA = int( a ) # a gets fused into (int)
... for i in range( intA ):# original, keep looping all i-s
... if i % 6 == 5: # if i-th "center":
... for j in range( i+1, intA ):# keep looping all j-s
... if j % 6 == 5: # if j-th "center":
... # NOP, # do some usefull work
... # not important here #
... # for the testing, the overhead comparison matters
... continue # loop next
... return
>>> aClk.start(); _ = aPairOfNaiveFORs( 1E3 ); aClk.stop()
14770 [us] for 1E3 ~ 13+ [us] per one item of a[:1000]
14248
13430
13252
>>> aClk.start(); _ = aPairOfNaiveFORs( 1E5 ); aClk.stop()
64559866 [us] for 1E5 ~ 645+ [us] per one item of a[:10000]
现在,避免所有不符合5 == <index> modulo 6
的荒谬循环。>>>> def aPairOfLessNaiveFORs( a = 1E6 ): ... intA = int( a ) ... for i in range ( 5, intA, 6 ): # yes, start with the 5-th item of the first matching "center" ... for j in range( i+6, intA, 6 ): # yes, keep matching ( modulo 6 ) steps ... continue
测试显示
更快的代码(避免5/6循环):~ +13x ~ +17x
>>> aClk.start(); _ = aPairOfLessNaiveFORs( 1E3 ); aClk.stop()
1171 [us] for 1E3 ~ 1 [us] per one item of a[:1000]
976
1157
1181
1185
>>> aClk.start(); _ = aPairOfLessNaiveFORs( 1E5 ); aClk.stop()
3725167 [us] for 1E5 ~ 37 [us] per one item of a[:10000]
RESUME:
>>> aClk.start(); _ = aPairOfLessNaiveFORs( 5E5 ); aClk.stop()
100104334 [us] for 5E5 ~ 17.6 x FASTER
>>> aClk.start(); _ = aPairOfNaiveFORs( 5E5 ); aClk.stop()
1759420300 [us] for 5E5 ~ ^^^^^^^^^^^^^
甚至是it easily need not get any "positive"-boost进入并行计算模型(记住corrected Amdahl's Law,这样做的附加开销也不会被忽略或忽略)。