我一直在努力让 pytorch 识别我的 Nvidia Tesla P40。我每天在 Dell Precision 7910 机架上运行 Ubuntu 24.04(我可能应该运行 22.04)。
我尝试手动安装 Nvidia Datacenter 驱动程序以及 Ubuntu 驱动程序。该卡被操作系统识别:
spci | grep -i nvidia
83:00.0 3D controller: NVIDIA Corporation GP102GL [Tesla P40] (rev a1)
和
sudo lshw -c video
*-display
description: VGA compatible controller
product: G200eR2
vendor: Matrox Electronics Systems Ltd.
physical id: 0
bus info: pci@0000:0b:00.0
logical name: /dev/fb0
version: 01
width: 32 bits
clock: 33MHz
capabilities: pm vga_controller bus_master cap_list rom fb
configuration: depth=32 driver=mgag200 latency=64 maxlatency=32 mingnt=16 resolution=1600,1200
resources: irq:19 memory:90000000-90ffffff memory:91800000-91803fff memory:91000000-917fffff memory:c0000-dffff
*-display
description: 3D controller
product: GP102GL [Tesla P40]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:83:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress bus_master cap_list
configuration: driver=nvidia latency=0
resources: iomemory:3f00-3eff iomemory:3f80-3f7f irq:116 memory:c8000000-c8ffffff memory:3f000000000-3f7ffffffff memory:3f800000000-3f801ffffff
当我尝试使用 pytorch 获取 cuda 设备时:
>>> import torch
>>> print(torch.cuda.device_count())
0
如果我尝试查看 cuda 是否可用:
>>> import torch
>>> print(torch.cuda.is_available())
/home/brett/Documents/whisper/.venv/lib/python3.11/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
最终目标是运行一些 OpenAi Whisper 翻译,但即使我将其传递给 cuda 设备,Whisper 也会失败:
(.venv) $ whisper --model base --device cuda:0 tests/jfv.flac
/home/brett/Documents/whisper/.venv/lib/python3.11/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
Traceback (most recent call last):
File "/home/brett/Documents/whisper/.venv/bin/whisper", line 8, in <module>
sys.exit(cli())
^^^^^
File "/home/brett/Documents/whisper/.venv/lib/python3.11/site-packages/whisper/transcribe.py", line 577, in cli
model = load_model(model_name, device=device, download_root=model_dir)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/brett/Documents/whisper/.venv/lib/python3.11/site-packages/whisper/__init__.py", line 146, in load_model
checkpoint = torch.load(fp, map_location=device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/brett/Documents/whisper/.venv/lib/python3.11/site-packages/torch/serialization.py", line 1014, in load
return _load(opened_zipfile,
^^^^^^^^^^^^^^^^^^^^^
File "/home/brett/Documents/whisper/.venv/lib/python3.11/site-packages/torch/serialization.py", line 1422, in _load
result = unpickler.load()
^^^^^^^^^^^^^^^^
File "/home/brett/Documents/whisper/.venv/lib/python3.11/site-packages/torch/serialization.py", line 1392, in persistent_load
typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/brett/Documents/whisper/.venv/lib/python3.11/site-packages/torch/serialization.py", line 1366, in load_tensor
wrap_storage=restore_location(storage, location),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/brett/Documents/whisper/.venv/lib/python3.11/site-packages/torch/serialization.py", line 1296, in restore_location
return default_restore_location(storage, map_location)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/brett/Documents/whisper/.venv/lib/python3.11/site-packages/torch/serialization.py", line 381, in default_restore_location
result = fn(storage, location)
^^^^^^^^^^^^^^^^^^^^^
File "/home/brett/Documents/whisper/.venv/lib/python3.11/site-packages/torch/serialization.py", line 274, in _cuda_deserialize
device = validate_cuda_device(location)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/brett/Documents/whisper/.venv/lib/python3.11/site-packages/torch/serialization.py", line 258, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
我尝试了备用的 GTX 1650,它工作得很好...除了 4GB Vram 限制太小。
任何帮助将不胜感激!明天我会尝试运行win11
来自 NVIDIA pytorch 发布页面。