我的函数计算洛伦兹给定频率、fwhm、amp。我想对其进行矢量化,以便它可以计算频率、fwhms 和放大器列表:
def lorz1(freq_series, freq, fwhm, amp):
numerator = fwhm
denominator = (2*np.pi) * ((freq_series[:,None] - freq)**2 + fwhm**2/4)
lor = numerator / denominator
main_peak = amp*(lor/np.linalg.norm(lor, axis=0))
return np.sum(main_peak, axis=1)
def lorz2(freq_series, freq, fwhm, amp):
numerator = fwhm[:,None]
denominator = (2*np.pi) * ((freq_series - freq[:,None])**2 + fwhm[:,None]**2/4)
lor = numerator / denominator
main_peak = amp[:,None]*(lor/np.linalg.norm(lor, axis=1)[:,None])
return np.sum(main_peak, axis=0)
def lorz3(freq_series, freq, fwhm, amp):
numerator = fwhm
denominator = (2*np.pi) * ((freq_series - freq)**2 + fwhm**2/4)
lor = numerator / denominator
main_peak = amp*(lor/np.linalg.norm(lor))
return main_peak
series = np.linspace(0,100,50000)
freq = np.random.uniform(5,50,50)
fwhm = np.random.uniform(0.01,0.05,50)
amps = np.random.uniform(5,500,50)
时间:
%timeit lorz1(series, freq, fwhm, amps)
每次循环 38.4 ms ± 1.7 ms(7 次运行的平均值 ± 标准差,每次 10 个循环)
%timeit lorz2(series, freq, fwhm, amps)
每次循环 29.8 ms ± 1.8 ms(7 次运行的平均值 ± 标准差,每次 10 个循环)
%timeit np.sum(np.array([lorz3(series, item1, item2, item3)
for (item1,item2,item3) in zip(freq, fwhm, amps)]), axis=0)
每次循环 24.1 ms ± 5.02 ms(7 次运行的平均值 ± 标准差,每次 10 个循环)
lorz1
和lorz2
中的矢量化哪里出了问题?他们不是应该比lorz3
更快吗?
我使用两个版本做了一些进一步的分析:
版本1:
#!/usr/bin/env python3
import numpy as np
def lorz2(freq_series, freq, fwhm, amp):
numerator = fwhm[:,None]
denominator = (2*np.pi) * ((freq_series - freq[:,None])**2 + fwhm[:,None]**2/4)
lor = numerator / denominator
main_peak = amp[:,None]*(lor/np.linalg.norm(lor, axis=1)[:,None])
return np.sum(main_peak, axis=0)
series = np.linspace(0,100,50000)
freq = np.random.uniform(5,50,50)
fwhm = np.random.uniform(0.01,0.05,50)
amps = np.random.uniform(5,500,50)
for _ in range(100):
lorz2(series, freq, fwhm, amps)
和版本 2:
#!/usr/bin/env python3
import numpy as np
def lorz3(freq_series, freq, fwhm, amp):
numerator = fwhm
denominator = (2*np.pi) * ((freq_series - freq)**2 + fwhm**2/4)
lor = numerator / denominator
main_peak = amp*(lor/np.linalg.norm(lor))
return main_peak
series = np.linspace(0,100,50000)
freq = np.random.uniform(5,50,50)
fwhm = np.random.uniform(0.01,0.05,50)
amps = np.random.uniform(5,500,50)
for _ in range(100):
sum(lorz3(series, item1, item2, item3)
for (item1,item2,item3) in zip(freq, fwhm, amps))
请注意我如何将
lorz3
的求和调整为普通的旧 Python sum
。在我的测试中这更快,因为它避免了临时数组构造。
以下是我所做的一些分析的结果:
perf stat -ddd ./lorz2.py
Performance counter stats for './lorz2.py':
2729.16 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
217114 page-faults:u # 79.554 K/sec
8192141440 cycles:u # 3.002 GHz (38.43%)
3178961202 instructions:u # 0.39 insn per cycle (46.12%)
426575242 branches:u # 156.303 M/sec (53.81%)
2177628 branch-misses:u # 0.51% of all branches (61.51%)
42020185035 slots:u # 15.397 G/sec (69.20%)
323473974 topdown-retiring:u # 0.6% Retiring (69.20%)
33616148028 topdown-bad-spec:u # 67.1% Bad Speculation (69.20%)
371211166 topdown-fe-bound:u # 0.7% Frontend Bound (69.20%)
15767347418 topdown-be-bound:u # 31.5% Backend Bound (69.20%)
813550722 L1-dcache-loads:u # 298.096 M/sec (69.19%)
546814255 L1-dcache-load-misses:u # 67.21% of all L1-dcache accesses (69.21%)
82889242 LLC-loads:u # 30.372 M/sec (69.22%)
67633317 LLC-load-misses:u # 81.59% of all LL-cache accesses (69.24%)
<not supported> L1-icache-loads:u
9705348 L1-icache-load-misses:u # 0.00% of all L1-icache accesses (30.81%)
864895659 dTLB-loads:u # 316.909 M/sec (30.79%)
117310 dTLB-load-misses:u # 0.01% of all dTLB cache accesses (30.78%)
<not supported> iTLB-loads:u
85530 iTLB-load-misses:u # 0.00% of all iTLB cache accesses (30.76%)
<not supported> L1-dcache-prefetches:u
<not supported> L1-dcache-prefetch-misses:u
2.729696014 seconds time elapsed
1.932708000 seconds user
0.796504000 seconds sys
这是更快的版本:
Performance counter stats for './lorz3.py':
878.49 msec task-clock:u # 0.999 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
52869 page-faults:u # 60.182 K/sec
3704170220 cycles:u # 4.217 GHz (38.22%)
3735225800 instructions:u # 1.01 insn per cycle (45.96%)
568575253 branches:u # 647.221 M/sec (53.70%)
2580294 branch-misses:u # 0.45% of all branches (61.43%)
18355798588 slots:u # 20.895 G/sec (69.17%)
3328525030 topdown-retiring:u # 17.5% Retiring (69.17%)
6982401815 topdown-bad-spec:u # 36.6% Bad Speculation (69.17%)
1297505291 topdown-fe-bound:u # 6.8% Frontend Bound (69.17%)
7459691283 topdown-be-bound:u # 39.1% Backend Bound (69.17%)
858082535 L1-dcache-loads:u # 976.773 M/sec (69.28%)
430569310 L1-dcache-load-misses:u # 50.18% of all L1-dcache accesses (69.40%)
15723297 LLC-loads:u # 17.898 M/sec (69.49%)
73709 LLC-load-misses:u # 0.47% of all LL-cache accesses (69.50%)
<not supported> L1-icache-loads:u
38705486 L1-icache-load-misses:u # 0.00% of all L1-icache accesses (30.72%)
860276161 dTLB-loads:u # 979.270 M/sec (30.60%)
86213 dTLB-load-misses:u # 0.01% of all dTLB cache accesses (30.51%)
<not supported> iTLB-loads:u
91069 iTLB-load-misses:u # 0.00% of all iTLB cache accesses (30.50%)
<not supported> L1-dcache-prefetches:u
<not supported> L1-dcache-prefetch-misses:u
0.878946776 seconds time elapsed
0.852205000 seconds user
0.026744000 seconds sys
请注意,更快的代码中的指令数量实际上略高,这是有道理的,因为它的矢量化程度较低,但每个周期的指令数量更高,整体速度更快。较慢版本中的 LLC 负载是其两倍,其中大多数未命中,而此处几乎全部命中。我不知道如何解释
topdown-bad-spec
计数器。也许其他人可以对此发表评论。
CPU 甚至会降低时钟频率(这是可重现的),这支持了它只是在等待内存的想法。
此外,请注意最后一行中的
sys
时间。 lorz2
将 28% 的运行时间花费在内核空间中。因为它不做任何与 IO 相关的事情,所以这就是所有内存分配和释放开销。
我们可以进一步了解一下停滞原因:
perf stat -e cycles,l1d_pend_miss.l2_stall,cycle_activity.stalls_l3_miss ./lorz2.py
Performance counter stats for './lorz2.py':
8446540078 cycles:u
1953955881 l1d_pend_miss.l2_stall:u
3050292324 cycle_activity.stalls_l3_miss:u
2.748141433 seconds time elapsed
1.959570000 seconds user
0.788443000 seconds sys
perf stat -e cycles,l1d_pend_miss.l2_stall,cycle_activity.stalls_l3_miss ./lorz3.py
Performance counter stats for './lorz3.py':
3674547216 cycles:u
303870088 l1d_pend_miss.l2_stall:u
16939496 cycle_activity.stalls_l3_miss:u
0.869909182 seconds time elapsed
0.848122000 seconds user
0.021752000 seconds sys
因此,
lorz2
版本只是在 2 级或 3 级缓存未命中时不断停止。
我们可以进一步看一个简单的
perf report
perf record ./lorz2.py
perf report
# Overhead Command Shared Object Symbol
# ........ ....... ................................................. ...............................................
#
32.44% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_multiply
27.03% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_divide
17.34% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_add
6.93% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_subtract
6.12% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_pairwise_sum
5.32% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_square
0.51% python3 [unknown] [k] 0xffffffffb2800fe7
0.46% python3 libpython3.11.so.1.0 [.] _PyEval_EvalFrameDefault
0.19% python3 libpython3.11.so.1.0 [.] unicodekeys_lookup_unicode
0.18% python3 libpython3.11.so.1.0 [.] gc_collect_main
...
perf record ./lorz3.py
perf report
# Overhead Command Shared Object Symbol
# ........ ....... ................................................. .............................................
#
27.56% python3 libblas.so.3.11.0 [.] ddot_
27.47% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_divide
8.64% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_subtract
8.34% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_add
5.84% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_multiply
3.70% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_square
1.47% python3 libpython3.11.so.1.0 [.] _PyEval_EvalFrameDefault
1.38% python3 libgcc_s.so.1 [.] execute_cfa_program
0.89% python3 libgcc_s.so.1 [.] uw_update_context_1
0.88% python3 libgcc_s.so.1 [.] _Unwind_Find_FDE
0.64% python3 libgcc_s.so.1 [.] uw_frame_state_for
0.61% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] ufunc_generic_fastcall
...
呵呵,很有趣。点积从哪里来?我认为这就是
linalg.norm
对于简单向量的实现方式。
顺便说一句,我们可以通过 3 种措施稍微加快
lorz2
和 lorz3
版本的速度:
def lorz2a(freq_series, freq, fwhm, amp):
numerator = fwhm[:,None] * (0.5 / np.pi)
denominator = (freq_series - freq[:,None])**2 + 0.25 * fwhm[:,None]**2
lor = numerator / denominator
return (amp / np.linalg.norm(lor, axis=1)) @ lor
def lorz3a(freq_series, freq, fwhm, amp):
numerator = fwhm * (0.5 / np.pi)
denominator = (freq_series - freq)**2 + 0.25 * fwhm**2
lor = numerator / denominator
main_peak = amp / np.linalg.norm(lor) * lor
return main_peak
但这并没有改变总体趋势。
Numpy 向量化主要有助于减少每次调用的开销。一旦数组足够大,我们就不会从中获得太多好处,因为与计算本身相比,剩余的解释器开销很小。同时,更大的阵列会导致内存效率降低。通常,L2 或 L3 缓存大小附近存在一个最佳点。
lorz3
的实现比其他实现更好地解决了这个问题。
对于较小的系列大小和较大的其他数组大小,我们可以预期
lorz2
会表现更好。例如,这个数据集使我的lorz2a
比我的lorz3a
更快:
series = np.linspace(0,100,1000)
freq = np.random.uniform(5,50,2000)
fwhm = np.random.uniform(0.01,0.05,2000)
amps = np.random.uniform(5,500,2000)
Numpy 简单、热切的评估方案将调整的责任交给了用户。其他库如 NumExpr 尝试避免这种情况。