值得注意的细节
首先是尝试使用以下函数的guvectorize。我传递了一堆numpy数组,并尝试使用它们在两个数组中相乘。如果使用cuda以外的目标运行,则此方法有效。但是,当切换到cuda时,会导致未知错误:
文件“C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ numba \ cuda \ decorators.py”,>第82行,在jitwrapper debug = debug中)
TypeError:init()得到一个意外的关键字参数'debug'
在完成了我从这个错误中找到的所有内容之后,我只打死了。我猜这是一个非常简单的解决方案,我完全不知道但是哦。还应该说这个错误只发生在运行一次并因内存过载而崩溃之后。
os.environ["NUMBA_ENABLE_CUDASIM"] = "1"
os.environ["CUDA_VISIBLE_DEVICES"] = "10DE 1B06 63933842"
...
所有阵列都是numpy
@guvectorize(['void(int64, float64[:,:], float64[:,:], float64[:,:,:],
int64, int64, float64[:,:,:])'], '(),(m,o),(m,o),(n,m,o),(),() -> (n,m,o)',
target='cuda', nopython=True)
def cVestDiscount (ed, orCV, vals, discount, n, rowCount, cv):
for as_of_date in range(0,ed):
for ID in range(0,rowCount):
for num in range(0,n):
cv[as_of_date][ID][num] = orCV[ID][num] * discount[as_of_date][ID][num]
尝试在命令行中使用nvprofiler运行代码会导致以下错误:
警告:当前配置不支持统一内存分析,因为在此多GPU设置上检测到一对没有对等支持的设备。当对等映射不可用时,系统将回退到使用零拷贝内存。它可能导致访问统一内存的内核运行速度变慢。更多细节可以在:http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory找到
我意识到我正在使用支持SLI的图形卡(两个卡都相同,evga gtx 1080ti,并且具有相同的设备ID),所以我禁用了SLI并添加了“CUDA_VISIBLE_DEVICES”行以尝试限制其他一张卡,但是我留下了相同的结果。
我仍然可以用nvprof运行代码,但是与njit(parallel = True)和prange相比,cuda函数很慢。通过使用较小的数据大小,我们可以运行代码,但它比target ='parallel'和target ='cpu'慢。
为什么cuda这么慢,这些错误意味着什么?
谢谢您的帮助!
编辑:这是代码的一个工作示例:
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer
@guvectorize(['void(int64, float64[:,:], float64[:,:,:], int64, int64, float64[:,:,:])'], '(),(m,o),(n,m,o),(),() -> (n,m,o)', target='cuda', nopython=True)
def cVestDiscount (countRow, multBy, discount, n, countCol, cv):
for as_of_date in range(0,countRow):
for ID in range(0,countCol):
for num in range(0,n):
cv[as_of_date][ID][num] = multBy[ID][num] * discount[as_of_date][ID][num]
countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
我可以使用gtx 1080ti在cuda中运行代码,但是,它比并行或cpu运行要慢得多。我查看过与guvectorize有关的其他帖子,但是没有一个帖子帮助我理解在guvectorize中运行的是什么和不是最优的。有没有办法让这个代码'cuda friendly',或者只是在数组之间进行乘法过于简单而无法获得任何好处?
首先,您展示的基本操作是采用两个矩阵,将它们传输到GPU,进行一些元素乘法以产生第三个数组,然后将第三个数组传递回主机。
有可能制作一个numba / cuda guvectorize(或cuda.jit内核)实现,它可能比一个天真的串行python实现运行得更快,但我怀疑它是否有可能超过编写良好的主机代码的性能(例如使用一些并行化方法,如guvectorize)来做同样的事情。这是因为在主机和设备之间传输的每字节算术运算强度太低。这个操作太简单了。
其次,我认为,重要的是要了解numba vectorize
和guvectorize
的目的。基本原则是从“工人会做什么?”的角度编写ufunc定义。然后允许numba从那里开出多个工人。指示numba启动多个worker的方法是传递一个大于你给出的签名的数据集。应该注意的是,numba不知道如何在ufunc定义中并行化for循环。它通过获取ufunc定义并在并行工作程序中运行它来获得并行“强度”,其中每个工作程序处理数据的“片段”,但在该片上运行整个ufunc定义。作为一些额外的阅读,我已经涵盖了一些这个基础here也。
因此,我们在您的实现中遇到的一个问题是您已经编写了一个签名(和ufunc),它将整个输入数据集映射到一个worker。正如@talonmies所展示的那样,你的底层内核正在被运用总共64个线程/工作者(这在GPU上很小很有趣,甚至除了上面关于算术强度的陈述之外),但我怀疑其实64实际上只是一个numba最小线程块大小,实际上该threadblock中只有1个线程正在进行任何有用的计算工作。一个线程以串行方式执行整个ufunc,包括所有for循环。
这显然不是任何人打算合理使用vectorize
或guvectorize
。
所以让我们重新审视你想要做的事情。最终,你的ufunc想要将一个数组的输入值乘以另一个数组的输入值,并将结果存储在第三个数组中。我们想多次重复这个过程。如果所有3个数组大小相同,我们实际上可以用vectorize
实现这一点,甚至不必诉诸更复杂的guvectorize
。让我们将这种方法与原始方法进行比较,重点关注CUDA内核执行。这是一个有用的例子,其中t14.py是您的原始代码,使用探查器运行,而t15.py是它的vectorize
版本,承认我们已经更改了multBy
数组的大小以匹配cv
和discount
:
$ nvprof --print-gpu-trace python t14.py
==4145== NVPROF is profiling process 4145, command: python t14.py
Function: discount factor cumVest duration (seconds):1.24354910851
==4145== Profiling application: python t14.py
==4145== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
312.36ms 1.2160us - - - - - 8B 6.2742MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
312.81ms 27.392us - - - - - 156.25KB 5.4400GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
313.52ms 5.8696ms - - - - - 15.259MB 2.5387GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
319.74ms 1.0880us - - - - - 8B 7.0123MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
319.93ms 896ns - - - - - 8B 8.5149MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
321.40ms 1.22538s (1 1 1) (64 1 1) 63 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [37]
1.54678s 7.1816ms - - - - - 15.259MB 2.0749GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer
@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
return a * b
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t15.py
==4167== NVPROF is profiling process 4167, command: python t15.py
Function: discount factor cumVest duration (seconds):0.37507891655
==4167== Profiling application: python t15.py
==4167== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
193.92ms 6.2729ms - - - - - 15.259MB 2.3755GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
201.09ms 5.7101ms - - - - - 15.259MB 2.6096GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
364.92ms 842.49us (15625 1 1) (128 1 1) 13 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__vectorized_cVestDiscount$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>) [31]
365.77ms 7.1528ms - - - - - 15.259MB 2.0833GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$
我们看到您的应用程序报告的运行时间约为1.244秒,而vectorize版本报告的运行时间约为0.375秒。但是这两个数字都存在python开销。如果我们在分析器中查看生成的CUDA内核持续时间,则差异更加明显。我们看到原始内核花了大约1.225秒,而矢量化内核在大约842微秒(即小于1毫秒)内执行。我们还注意到,计算内核时间现在远远小于将3个阵列传输到GPU(从总共花费大约20毫秒)所需的时间,并且我们注意到内核尺寸现在是15625个块128个每个线程的总线程/工作者数量为2000000,与要完成的乘法操作的总数完全匹配,并且大大超过了原始代码中的微不足道的64个线程(可能实际上只有1个线程)。
鉴于上述vectorize
方法的简单性,如果您真正想要做的是这种逐元素乘法,那么您可以考虑复制multBy
,使其在尺寸上与其他两个数组匹配,并完成它。
但问题仍然存在:如何处理不同的输入数组大小,如原始问题?为此,我认为我们需要去guvectorize
(或者,正如@talonmies指出的那样,编写自己的@cuda.jit
内核,这可能是最好的建议,尽管这些方法都不可能克服向/从数据传输数据的开销。设备,如前所述)。
为了用guvectorize
解决这个问题,我们需要更仔细地考虑已经提到的“切片”概念。让我们重新编写你的guvectorize
内核,使它只在整个数据的“切片”上运行,然后允许guvectorize
启动功能启动多个工作人员来解决它,每个切片一个工人。
在CUDA,我们喜欢有很多工人;你真的不能有太多。因此,这将影响我们如何“切片”我们的数组,以便为多个工作人员提供行动的可能性。如果我们沿着第3个(最后一个,n
)维度切片,我们只有5个切片可以使用,所以最多5个工人。同样,如果我们沿着第一个或countRow
维度切片,我们将有100个切片,因此最多100个工人。理想情况下,我们会沿着第二个或countCol
维度切片。但是为了简单起见,我将沿着第一个或countRow
维度进行切片。这显然不是最优的,但请参阅下面的有关如何处理二维切片问题的工作示例。按第一维切片意味着我们将从guvectorize内核中删除第一个for循环,并允许ufunc系统沿该维度并行化(基于我们传递的数组的大小)。代码看起来像这样:
$ cat t16.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer
@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (multBy, discount, n, countCol, cv):
for ID in range(0,countCol):
for num in range(0,n):
cv[ID][num] = multBy[ID][num] * discount[ID][num]
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t16.py
==4275== NVPROF is profiling process 4275, command: python t16.py
Function: discount factor cumVest duration (seconds):0.0670170783997
==4275== Profiling application: python t16.py
==4275== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
307.05ms 27.392us - - - - - 156.25KB 5.4400GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
307.79ms 5.9293ms - - - - - 15.259MB 2.5131GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
314.34ms 1.3440us - - - - - 8B 5.6766MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
314.54ms 896ns - - - - - 8B 8.5149MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
317.27ms 47.398ms (2 1 1) (64 1 1) 63 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__gufunc_cVestDiscount$242(Array<double, int=3, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
364.67ms 7.3799ms - - - - - 15.259MB 2.0192GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$
观察:
countCol
参数,从guvectorize内核中删除第一个for循环以及对函数签名进行适当更改以反映这一点有关。我们还将签名中的三维函数修改为二维。毕竟,我们正在对三维数据进行二维“切片”,并让每个工作人员在片上工作。vectorize
版本(约0.001s)之间。因此,从1到100的工作人员已经大大加快了速度,但可能会有更多的性能提升。如果你弄清楚如何切割countCol
维度,你可能更接近vectorize
版本,性能方面(见下文)。请注意,我们在这里(~47ms)和矢量化版本(~1ms)之间的差异足以弥补将略大的multBy
矩阵传输到的额外传输成本(~5ms或更少)。设备,以方便vectorize
简单。关于python时序的一些额外评论:我相信python如何编译原始,vectorize和guvectorize改进版本的必要内核的确切行为是不同的。如果我们修改t15.py代码来运行“预热”运行,那么至少python时序是一致的,趋势是整体的挂起时间和内核时间:
$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer
@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
return a * b
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
#warm-up run
cv = cVestDiscount(multBy, discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ time python t14.py
Function: discount factor cumVest duration (seconds):1.24376320839
real 0m2.522s
user 0m1.572s
sys 0m0.809s
$ time python t15.py
Function: discount factor cumVest duration (seconds):0.0228319168091
real 0m1.050s
user 0m0.473s
sys 0m0.445s
$ time python t16.py
Function: discount factor cumVest duration (seconds):0.0665760040283
real 0m1.252s
user 0m0.680s
sys 0m0.441s
$
现在,回答评论中的一个问题,有效地说:“我如何重新解决问题以沿着4000(countCol
或”中间“)维度切片?”
我们可以通过在第一维度上切片的方法来指导。一种可能的方法是重新排列阵列的形状,使4000维度是第一维度,然后去除它,类似于我们在之前的guvectorize
处理中所做的。这是一个有效的例子:
$ cat t17.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer
@guvectorize(['void(int64, float64[:], float64[:,:], int64, float64[:,:])'], '(),(o),(m,o),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (countCol, multBy, discount, n, cv):
for ID in range(0,countCol):
for num in range(0,n):
cv[ID][num] = multBy[num] * discount[ID][num]
countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(4000,100,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(4000,100,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ python t17.py
Function: discount factor cumVest duration (seconds):0.0266749858856
$ nvprof --print-gpu-trace python t17.py
==8544== NVPROF is profiling process 8544, command: python t17.py
Function: discount factor cumVest duration (seconds):0.0268459320068
==8544== Profiling application: python t17.py
==8544== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
304.92ms 1.1840us - - - - - 8B 6.4437MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
305.36ms 27.392us - - - - - 156.25KB 5.4400GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
306.08ms 6.0208ms - - - - - 15.259MB 2.4749GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
312.44ms 1.0880us - - - - - 8B 7.0123MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD]
313.59ms 8.9961ms (63 1 1) (64 1 1) 63 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=2, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
322.59ms 7.2772ms - - - - - 15.259MB 2.0476GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$
有点可以预见的是,我们观察到执行时间从我们切成100个工人时的约47毫秒下降到切成4000个工人时的约9毫秒。类似地,我们观察到numba选择旋转63个64个线程的块,总共4032个线程,以处理这个“切片”所需的4000个工作者。
仍然不如〜1ms vectorize
内核(对于工作者来说有更多可用的并行“切片”)快,但比原始问题中提出的~1.2s内核快得多。并且python代码的总体walltime大约快2倍,即使有所有python开销。
作为最后的观察,让我们重新审视我之前的陈述(与评论和其他答案中的陈述类似):
“我怀疑是否有可能超出编写良好的主机代码的性能(例如使用一些并行化方法,例如guvectorize)来做同样的事情。”
我们现在在t16.py或t17.py中都有方便的测试用例,我们可以使用它来测试它。为简单起见,我会选择t16.py.我们可以简单地通过从guvectorize
ufunc中删除目标指定来“将其转换回CPU代码”:
$ cat t16a.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer
@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)')
def cVestDiscount (multBy, discount, n, countCol, cv):
for ID in range(0,countCol):
for num in range(0,n):
cv[ID][num] = multBy[ID][num] * discount[ID][num]
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ time python t16a.py
Function: discount factor cumVest duration (seconds):0.00657796859741
real 0m0.528s
user 0m0.474s
sys 0m0.047s
$
因此我们看到这个仅CPU版本在大约6毫秒内运行该函数,并且它没有GPU“开销”,例如CUDA初始化,以及数据到/从GPU的复制。整体壁挂时间也是我们最好的测量值,大约0.5s,而我们最好的GPU情况下大约1.0s。因此,由于每字节数据传输的算术强度较低,这个特殊问题可能不太适合GPU计算。
gufunc Numba发射和运行的原因是如此缓慢,在分析时会立即变得明显(使用CUDA 8.0的numba 0.38.1)
==24691== Profiling application: python slowvec.py
==24691== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
271.33ms 1.2800us - - - - - 8B 5.9605MB/s GeForce GTX 970 1 7 [CUDA memcpy HtoD]
271.65ms 14.591us - - - - - 156.25KB 10.213GB/s GeForce GTX 970 1 7 [CUDA memcpy HtoD]
272.09ms 2.5868ms - - - - - 15.259MB 5.7605GB/s GeForce GTX 970 1 7 [CUDA memcpy HtoD]
274.98ms 992ns - - - - - 8B 7.6909MB/s GeForce GTX 970 1 7 [CUDA memcpy HtoD]
275.17ms 640ns - - - - - 8B 11.921MB/s GeForce GTX 970 1 7 [CUDA memcpy HtoD]
276.33ms 657.28ms (1 1 1) (64 1 1) 40 0B 0B - - GeForce GTX 970 1 7 cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [38]
933.62ms 3.5128ms - - - - - 15.259MB 4.2419GB/s GeForce GTX 970 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
生成代码的结果内核启动使用64个线程的单个块。在GPU上理论上每MP可以拥有多达2048个线程,而23 MP,这意味着GPU的理论处理能力的大约99.9%没有被使用。这看起来像是numba开发人员的一个荒谬的设计选择,如果你受到它的阻碍(我看来你是这样),我会把它报告为一个bug。
显而易见的解决方案是将您的函数重写为CUDA python内核方言中的@cuda.jit
函数,并对执行参数进行显式控制。这样,您至少可以确保代码将以足够的线程运行,从而可能使用硬件的所有容量。它仍然是一个非常受内存限制的操作,因此您在加速时可以实现的限制可能远远小于GPU的内存带宽与CPU的比率。这可能不足以分摊主机到设备内存传输的成本,因此在最好的情况下可能没有性能提升,即使这远远不是这样。
简而言之,要注意自动编译器产生并行性的危险....
Postscript补充说,我设法弄清楚如何获取numba发出的代码的PTX,并且足以说它绝对是craptulacular(并且很长时间我实际上不能发布所有这些):
{
.reg .pred %p<9>;
.reg .b32 %r<8>;
.reg .f64 %fd<4>;
.reg .b64 %rd<137>;
ld.param.u64 %rd29, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_5];
ld.param.u64 %rd31, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_11];
ld.param.u64 %rd32, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_12];
ld.param.u64 %rd34, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_14];
ld.param.u64 %rd35, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_15];
ld.param.u64 %rd36, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_16];
ld.param.u64 %rd37, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_17];
ld.param.u64 %rd38, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_22];
ld.param.u64 %rd39, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_23];
ld.param.u64 %rd40, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_24];
ld.param.u64 %rd41, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_25];
ld.param.u64 %rd42, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_26];
ld.param.u64 %rd43, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_27];
ld.param.u64 %rd44, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_28];
ld.param.u64 %rd45, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_29];
ld.param.u64 %rd46, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_30];
ld.param.u64 %rd48, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_36];
ld.param.u64 %rd51, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_43];
ld.param.u64 %rd53, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_49];
ld.param.u64 %rd54, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_50];
ld.param.u64 %rd55, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_51];
ld.param.u64 %rd56, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_52];
ld.param.u64 %rd57, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_53];
ld.param.u64 %rd58, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_54];
ld.param.u64 %rd59, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_55];
ld.param.u64 %rd60, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_56];
ld.param.u64 %rd61, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_57];
mov.u32 %r1, %tid.x;
mov.u32 %r3, %ctaid.x;
mov.u32 %r2, %ntid.x;
mad.lo.s32 %r4, %r3, %r2, %r1;
min.s64 %rd62, %rd32, %rd29;
min.s64 %rd63, %rd39, %rd62;
min.s64 %rd64, %rd48, %rd63;
min.s64 %rd65, %rd51, %rd64;
min.s64 %rd66, %rd54, %rd65;
cvt.s64.s32 %rd1, %r4;
setp.le.s64 %p2, %rd66, %rd1;
@%p2 bra BB0_8;
ld.param.u64 %rd126, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_42];
ld.param.u64 %rd125, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_44];
ld.param.u64 %rd124, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_35];
ld.param.u64 %rd123, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_37];
ld.param.u64 %rd122, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_4];
ld.param.u64 %rd121, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_6];
cvt.u32.u64 %r5, %rd1;
setp.lt.s32 %p1, %r5, 0;
selp.b64 %rd67, %rd29, 0, %p1;
add.s64 %rd68, %rd67, %rd1;
mul.lo.s64 %rd69, %rd68, %rd121;
add.s64 %rd70, %rd69, %rd122;
selp.b64 %rd71, %rd48, 0, %p1;
add.s64 %rd72, %rd71, %rd1;
mul.lo.s64 %rd73, %rd72, %rd123;
add.s64 %rd74, %rd73, %rd124;
ld.u64 %rd2, [%rd74];
selp.b64 %rd75, %rd51, 0, %p1;
add.s64 %rd76, %rd75, %rd1;
mul.lo.s64 %rd77, %rd76, %rd125;
add.s64 %rd78, %rd77, %rd126;
ld.u64 %rd3, [%rd78];
ld.u64 %rd4, [%rd70];
setp.lt.s64 %p3, %rd4, 1;
@%p3 bra BB0_8;
ld.param.u64 %rd128, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_13];
ld.param.u64 %rd127, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_12];
selp.b64 %rd80, %rd127, 0, %p1;
mov.u64 %rd79, 0;
min.s64 %rd81, %rd128, %rd79;
min.s64 %rd82, %rd34, %rd79;
selp.b64 %rd83, %rd39, 0, %p1;
min.s64 %rd84, %rd40, %rd79;
min.s64 %rd85, %rd41, %rd79;
min.s64 %rd86, %rd42, %rd79;
selp.b64 %rd87, %rd54, 0, %p1;
min.s64 %rd88, %rd55, %rd79;
min.s64 %rd89, %rd56, %rd79;
min.s64 %rd90, %rd57, %rd79;
mul.lo.s64 %rd91, %rd90, %rd61;
add.s64 %rd92, %rd53, %rd91;
mul.lo.s64 %rd93, %rd89, %rd60;
add.s64 %rd94, %rd92, %rd93;
mul.lo.s64 %rd95, %rd88, %rd59;
add.s64 %rd96, %rd94, %rd95;
add.s64 %rd98, %rd87, %rd1;
mul.lo.s64 %rd99, %rd58, %rd98;
add.s64 %rd5, %rd96, %rd99;
mul.lo.s64 %rd100, %rd86, %rd46;
add.s64 %rd101, %rd38, %rd100;
mul.lo.s64 %rd102, %rd85, %rd45;
add.s64 %rd103, %rd101, %rd102;
mul.lo.s64 %rd104, %rd84, %rd44;
add.s64 %rd105, %rd103, %rd104;
add.s64 %rd106, %rd83, %rd1;
mul.lo.s64 %rd107, %rd43, %rd106;
add.s64 %rd6, %rd105, %rd107;
mul.lo.s64 %rd108, %rd82, %rd37;
add.s64 %rd109, %rd31, %rd108;
mul.lo.s64 %rd110, %rd81, %rd36;
add.s64 %rd111, %rd109, %rd110;
add.s64 %rd112, %rd80, %rd1;
mul.lo.s64 %rd113, %rd35, %rd112;
add.s64 %rd7, %rd111, %rd113;
add.s64 %rd8, %rd2, 1;
mov.u64 %rd131, %rd79;
BB0_3:
mul.lo.s64 %rd115, %rd59, %rd131;
add.s64 %rd10, %rd5, %rd115;
mul.lo.s64 %rd116, %rd44, %rd131;
add.s64 %rd11, %rd6, %rd116;
setp.lt.s64 %p4, %rd3, 1;
mov.u64 %rd130, %rd79;
mov.u64 %rd132, %rd3;
@%p4 bra BB0_7;
BB0_4:
mov.u64 %rd13, %rd132;
mov.u64 %rd12, %rd130;
mul.lo.s64 %rd117, %rd60, %rd12;
add.s64 %rd136, %rd10, %rd117;
mul.lo.s64 %rd118, %rd45, %rd12;
add.s64 %rd135, %rd11, %rd118;
mul.lo.s64 %rd119, %rd36, %rd12;
add.s64 %rd134, %rd7, %rd119;
setp.lt.s64 %p5, %rd2, 1;
mov.u64 %rd133, %rd8;
@%p5 bra BB0_6;
BB0_5:
mov.u64 %rd17, %rd133;
ld.f64 %fd1, [%rd135];
ld.f64 %fd2, [%rd134];
mul.f64 %fd3, %fd2, %fd1;
st.f64 [%rd136], %fd3;
add.s64 %rd136, %rd136, %rd61;
add.s64 %rd135, %rd135, %rd46;
add.s64 %rd134, %rd134, %rd37;
add.s64 %rd24, %rd17, -1;
setp.gt.s64 %p6, %rd24, 1;
mov.u64 %rd133, %rd24;
@%p6 bra BB0_5;
BB0_6:
add.s64 %rd25, %rd13, -1;
add.s64 %rd26, %rd12, 1;
setp.gt.s64 %p7, %rd13, 1;
mov.u64 %rd130, %rd26;
mov.u64 %rd132, %rd25;
@%p7 bra BB0_4;
BB0_7:
sub.s64 %rd120, %rd4, %rd131;
add.s64 %rd131, %rd131, 1;
setp.gt.s64 %p8, %rd120, 1;
@%p8 bra BB0_3;
BB0_8:
ret;
}
所有这些整数运算只执行一次双精度乘法!