我有一台 NVidia GeForce GTX 770,希望将其 CUDA 功能用于我正在进行的项目。我的机器运行的是 Windows 10 64 位。
我已按照提供的 CUDA Toolkit 安装指南进行操作:https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/。
安装驱动程序后,我打开示例解决方案(使用 Visual Studio 2019)并构建 deviceQuery 和 bandwidthTest 示例。这是输出:
设备查询:
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v11.3\bin\win64\Debug\deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce GTX 770"
CUDA Driver Version / Runtime Version 11.3 / 11.3
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2048 MBytes (2147483648 bytes)
(008) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Max Clock rate: 1137 MHz (1.14 GHz)
Memory Clock rate: 3505 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.3, CUDA Runtime Version = 11.3, NumDevs = 1
Result = PASS
带宽:
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: NVIDIA GeForce GTX 770
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 3.1
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 3.4
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 161.7
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
但是,当我尝试运行任何其他示例时,例如 CUDA 11.3 运行时模板提供的起始代码:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size);
__global__ void addKernel(int* c, const int* a, const int* b) {
int i = threadIdx.x;
c[i] = a[i] + b[i];
}
int main() {
const int arraySize = 5;
const int a[arraySize] = { 1, 2, 3, 4, 5 };
const int b[arraySize] = { 10, 20, 30, 40, 50 };
int c[arraySize] = { 0 };
// Add vectors in parallel.
cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "addWithCuda failed!");
return 1;
}
printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n", c[0], c[1], c[2], c[3], c[4]);
// cudaDeviceReset must be called before exiting in order for profiling and
// tracing tools such as Nsight and Visual Profiler to show complete traces.
cudaStatus = cudaDeviceReset();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceReset failed!");
return 1;
}
return 0;
}
// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(int* c, const int* a, const int* b, unsigned int size) {
int* dev_a = 0;
int* dev_b = 0;
int* dev_c = 0;
cudaError_t cudaStatus;
// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
goto Error;
}
// Allocate GPU buffers for three vectors (two input, one output) .
cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
// Copy input vectors from host memory to GPU buffers.
cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
// Launch a kernel on the GPU with one thread for each element.
addKernel << <1, size >> > (dev_c, dev_a, dev_b);
// Check for any errors launching the kernel
cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
goto Error;
}
// cudaDeviceSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
goto Error;
}
// Copy output vector from GPU buffer to host memory.
cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
Error:
cudaFree(dev_c);
cudaFree(dev_a);
cudaFree(dev_b);
return cudaStatus;
}
我收到以下错误:
addKernel launch failed: no kernel image is available for execution on the device
addWithCuda failed!
从此表中:https://docs.nvidia.com/deploy/cuda-compatibility/index.html#support-hardware__table-hardware-support您可以看到我的GPU的计算能力版本(3.0)实际上是兼容的使用已安装的驱动程序(465.19.01+),那么为什么我不能运行除查询和带宽测试之外的任何代码?
我也有类似的问题。我的笔记本电脑上有一张 Geforce 940 MX 卡,其 Cuda 功能为 5.0,且 CUDA 驱动程序为 11.7。
我解决这个问题的方法是将
compute_50,sm_50
包含在 Properties > CUDA C/C++ > Device > Code Generation
字段中。希望这有帮助。
您的 GTX770 GPU 是“Kepler”架构计算能力 3.0 设备。这些设备在 CUDA 10 发布周期中已被弃用,并且从 CUDA 11.0 开始不再支持它们
CUDA 10.2 版本是最后一个支持计算 3.0 设备的工具包。您将无法使 CUDA 11.0 或更高版本与您的 GPU 配合使用。查询和带宽测试使用的 API 不会尝试在 GPU 上运行代码,这就是为什么它们可以在任何其他示例不起作用的情况下工作。
我也有一个 GeForce 940 MX,但是,就我而言,我使用的是 KUbuntu 22.04,并且我已经解决了在编译命令中添加对该平台的支持的问题:
nvcc TestCUDA.cu -o testcu.bin --gpu-architecture=compute_50 --gpu-code=compute_50,sm_50,sm_52
之后,代码就可以正常工作了。但是,必须使用代码来处理错误以确定编译期间发生的情况。就我而言,我没有包含
cudaPeekAtLastError()
并且没有显示错误。
以下是该代支持的 SM 变体和示例卡(来源:Medium - 适用于各种 NVIDIA 卡的匹配 SM 架构(CUDA 架构和 CUDA gencode)
CUDA 7 及更高版本支持
费米(CUDA 3.2 到 CUDA 8)(从 CUDA 9 开始弃用):
SM20 或 SM_20、compute_30 — 较旧的显卡,例如 GeForce 400、500、600、GT-630
开普勒(CUDA 5 及更高版本):
SM30 或 SM_30、compute_30 — Kepler 架构(通用 — Tesla K40/K80、GeForce 700、GT-730) 添加对统一内存编程的支持
SM35 或 SM_35、compute_35 — 更具体的 Tesla K40 添加对动态并行性的支持。根据我的经验,与 SM30 相比并没有真正的优势。
SM37 或 SM_37、compute_37 — 更具体的 Tesla K80 添加更多寄存器。根据我的经验,与 SM30 相比并没有真正的优势 Maxwell(CUDA 6 及更高版本):
SM50 或 SM_50、compute_50 — Tesla/Quadro M 系列
SM52 或 SM_52、compute_52 — Quadro M6000、GeForce 900、GTX-970、GTX-980、GTX Titan X
SM53 或 SM_53、compute_53 — Tegra (Jetson) TX1 / Tegra X1 Pascal(CUDA 8 及更高版本)
SM60 或 SM_60、compute_60 — Quadro GP100、Tesla P100、DGX-1(通用 Pascal)
SM61 或 SM_61、compute_61 — GTX 1080、GTX 1070、GTX 1060、GTX 1050、GTX 1030、Titan Xp、Tesla P40、Tesla P4、NVIDIA Drive PX2 上的独立 GPU
SM62 或 SM_62、compute_62 — NVIDIA Drive PX2、Tegra (Jetson) TX2 上的集成 GPU
Volta(CUDA 9 及更高版本)
SM70 或 SM_70、compute_70 — 带 Volta 的 DGX-1、Tesla V100、GTX 1180 (GV104)、Titan V、Quadro GV100
SM72 或 SM_72、compute_72 — Jetson AGX Xavier 图灵(CUDA 10 及更高版本)
SM75 或 SM_75、compute_75 — GTX 图灵 — GTX 1660 Ti、RTX 2060、RTX 2070、RTX 2080、Titan RTX、Quadro RTX 4000、Quadro RTX 5000、Quadro RTX 6000、Quadro RTX 8000
我在笔记本电脑上使用 Geforce 930 MX 卡时遇到同样的问题并得到这个 消息:使用以下命令调用耳语 (CPP cuBLAS):C:\Users\Waseem\Downloads\Compressed\SubtitleEditBeta\Whisper\CPP cuBLAS\main.exe --language en --model "C:\Users\Waseem\Downloads\Compressed\ SubtitleEditBeta\Whisper\Cpp\Models ase.en.bin" --output-srt --print-progress "C:\Users\Waseem\AppData\Local\Temp e5e63ec-f329-4552-82dd-21b2508d5ef3.wav"消息:使用以下命令调用耳语 (CPP cuBLAS):C:\Users\Waseem\Downloads\Compressed\SubtitleEditBeta\Whisper\CPP cuBLAS\main.exe --language en --model "C:\Users\Waseem\Downloads\Compressed\ SubtitleEditBeta\Whisper\Cpp\Models ase.en.bin" --output-srt --print-progress "C:\Users\Waseem\AppData\Local\Temp e5e63ec-f329-4552-82dd-21b2508d5ef3.wav"
谁知道如何解决这个问题以及合适的CUDA Toolkit版本是什么