我在 Intel Core i7-14700K 处理器上运行了 Intel MKL LINPACK 测试,得到了 557 GFLOPS 的峰值性能,这似乎相当不现实。
Size LDA Align. Average Maximal
1000 1000 4 155.1099 216.8890
2000 2000 4 425.5128 459.9769
5000 5008 4 379.0532 393.7132
10000 10000 4 427.9537 435.6706
15000 15000 4 426.8314 427.5827
18000 18008 4 545.7857 549.8816
20000 20016 4 553.3485 553.5723
22000 22008 4 548.1379 552.2941
25000 25000 4 549.4231 555.0353
26000 26000 4 550.3011 554.8746
27000 27000 4 542.6011 542.6011
30000 30000 1 532.8780 532.8780
35000 35000 1 534.7904 534.7904
40000 40000 1 557.7524 557.7524
45000 45000 1 557.3916 557.3916
1000 大小的 155 GFLOPS 值似乎合理,但 557 GFLOPS 太高了。有人知道它是如何发生的吗?
我使用了以下套件:
http://registrationcenter-download.intel.com/akdlm/irc_nas/9752/l_mklb_p_2018.3.011.tgz
使用以下命令开始测试:
./runme_xeon64
我可以验证 14700k 的这些结果。使用英特尔 oneAPI 数学内核和 numpy,我能够在 python 中实现 550-650 GFLOPS,这会带来巨大的开销。需要明确的是,这是在所有内核上运行的,因为英特尔 blas 库经过了非常好的优化。
import numpy as np
from time import time_ns
def benchCPU(A, B, C):
for i in range(0, 20):
print("Iteration: " + "%d" % i)
C = np.matmul(C, A)
C = np.matmul(C, B)
C = C/np.max(C)
return 0
if __name__ == '__main__':
samples = 7000
A = np.random.rand(samples, samples).astype(np.float32)
B = np.random.rand(samples, samples).astype(np.float32)
C = np.random.rand(samples, samples).astype(np.float32)
t1 = time_ns()
t2 = time_ns()
tdly = t2 - t1
C = np.matmul(A, B)
print("CPU Test")
t1 = time_ns()
benchCPU(A, B, C)
t2 = time_ns()
t_cpu = t2 - t1 - tdly
operations = 2*20*(2*samples**3 - samples**2) # Matrix Multiplication Operations take 2n^3 - n^2, there are 20 iterations which each do 2 operations, max is considered negligible
print("CPU Throughput: " + "%.3f" % ((operations/(t_cpu*1e-9))*1e-12) + " TFLOPS")