要安装 RAPIDS,我已经安装了 WSL2。
但是导入cudf时仍然出现以下错误:
/home/zy-wsl/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/cudf/utils/_ptxcompiler.py:61: UserWarning: Error getting driver and runtime versions:
stdout:
stderr:
Traceback (most recent call last):
File "/home/zy-wsl/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 258, in ensure_initialized
self.cuInit(0)
File "/home/zy-wsl/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 331, in safe_cuda_api_call
self._check_ctypes_error(fname, retcode)
File "/home/zy-wsl/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 399, in _check_ctypes_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [100] Call to cuInit results in CUDA_ERROR_NO_DEVICE
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 4, in <module>
File "/home/zy-wsl/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 296, in __getattr__
self.ensure_initialized()
File "/home/zy-wsl/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 262, in ensure_initialized
raise CudaSupportError(f"Error at driver init: {description}")
...
Not patching Numba
warnings.warn(msg, UserWarning)
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...
---------------------------------------------------------------------------
CudaSupportError Traceback (most recent call last)
/mnt/d/learn-rapids/Untitled.ipynb Cell 4 line 1
----> 1 import cudf
File ~/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/cudf/__init__.py:26
20 from cudf.api.extensions import (
21 register_dataframe_accessor,
22 register_index_accessor,
23 register_series_accessor,
24 )
25 from cudf.api.types import dtype
---> 26 from cudf.core.algorithms import factorize
27 from cudf.core.cut import cut
28 from cudf.core.dataframe import DataFrame, from_dataframe, from_pandas, merge
File ~/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/cudf/core/algorithms.py:10
8 from cudf.core.copy_types import BooleanMask
9 from cudf.core.index import RangeIndex, as_index
---> 10 from cudf.core.indexed_frame import IndexedFrame
11 from cudf.core.scalar import Scalar
12 from cudf.options import get_option
File ~/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/cudf/core/indexed_frame.py:59
57 from cudf.core.dtypes import ListDtype
...
302 if USE_NV_BINDING:
303 return self._cuda_python_wrap_fn(fname)
CudaSupportError: Error at driver init:
Call to cuInit results in CUDA_ERROR_NO_DEVICE (100):
尝试了下面最新的安装行:
conda create --solver=libmamba -n rapids-23.12 -c rapidsai-nightly -c conda-forge -c nvidia \
cudf=23.12 cuml=23.12 python=3.10 cuda-version=12.0 \
jupyterlab
NVIDIA-SMI 545.23.05 Driver Version: 545.84 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:01:00.0 On | Off |
| 30% 53C P3 54W / 300W | 1783MiB / 49140MiB | 10% Default |
| | | N/A
此外,cudf 已在 conda 环境中:
cudf 23.12.00a cuda12_py310_231028_g2a923dfff8_124 rapidsai-nightly
cuml 23.12.00a cuda12_py310_231028_gff635fc25_31 rapidsai-nightly
我还尝试在 wsl 环境中使用 numba-s,并发现以下内容:
__CUDA Information__
CUDA Device Initialized : False
CUDA Driver Version : ?
CUDA Runtime Version : ?
CUDA NVIDIA Bindings Available : ?
CUDA NVIDIA Bindings In Use : ?
CUDA Minor Version Compatibility Available : ?
CUDA Minor Version Compatibility Needed : ?
CUDA Minor Version Compatibility In Use : ?
CUDA Detect Output:
None
CUDA Libraries Test Output:
None
__Warning log__
Warning (cuda): CUDA device initialisation problem. Message:Error at driver init: Call to cuInit results in CUDA_ERROR_NO_DEVICE (100)
Exception class: <class 'numba.cuda.cudadrv.error.CudaSupportError'>
Warning (no file): /sys/fs/cgroup/cpuacct/cpu.cfs_quota_us
Warning (no file): /sys/fs/cgroup/cpuacct/cpu.cfs_period_us
似乎 CUDA 没有在 wsl 中启动,但是当我在 Windows 提示符下运行此命令时,它返回:
__CUDA Information__
CUDA Device Initialized : True
CUDA Driver Version : ?
CUDA Runtime Version : ?
CUDA NVIDIA Bindings Available : ?
CUDA NVIDIA Bindings In Use : ?
CUDA Minor Version Compatibility Available : ?
CUDA Minor Version Compatibility Needed : ?
CUDA Minor Version Compatibility In Use : ?
CUDA Detect Output:
Found 1 CUDA devices
id 0 b'NVIDIA RTX A6000' [SUPPORTED]
Compute Capability: 8.6
PCI Device ID: 0
PCI Bus ID: 1
UUID: GPU-17e7be94-251e-a2d9-3924-d167c0e59a56
Watchdog: Enabled
Compute Mode: WDDM
FP32/FP64 Performance Ratio: 32
Summary:
1/1 devices are supported
CUDA Libraries Test Output:
None
__Warning log__
Warning (cuda): Probing CUDA failed (device and driver present, runtime problem?)
(cuda) <class 'FileNotFoundError'>: Could not find module 'cudart.dll' (or one of its dependencies). Try using the full path with constructor syntax.
问题已经解决了。执行以下操作在 nano .bashrc 中注册 wsl实例下:
sudo nano .bashrc
插入以下内容:
export LD_LIBRARY_PATH="/usr/lib/wsl/lib/"
export NUMBA_CUDA_DRIVER="/usr/lib/wsl/lib/libcuda.so.1"
然后:
source .bashrc
如果这对其他人有帮助,我收到了类似的错误
numba.cuda.cudadrv.error.CudaSupportError: Error at driver init: Call to cuInit results in CUDA_ERROR_OUT_OF_MEMORY (2)
,系统配置如下:
conda create --solver=libmamba -n rapids-23.12 -c rapidsai -c conda-forge -c nvidia rapids=23.12 python=3.10 cuda-version=12.0
就我而言,因为我有 4 个独立 GPU,所以 WSL 内部的事情很混乱。
我的错误仅限于那些使用 WSL2 且其设置中存在多个 GPU 的人。我记得读过 WSL2 仅支持 1 个 GPU(https://docs.rapids.ai/install#wsl2-conda:“仅支持单个 GPU”和“不支持 GPU 直接存储”)。但没有详细记录表明您需要帮助 Python 定位受支持的特定 GPU。
为了克服这个错误,有必要明确规定 CUDA_VISIBLE_DEVICES 环境变量,我建议通过添加以下行来将其作为 ~/.bashrc 中的环境变量: 导出 CUDA_VISIBLE_DEVICES=0
请注意,这是零索引,并且是 GPU 的 ID。
但是,经过一些实验,我发现通过 Conda 在 WSL2 上安装 RAPIDS 确实支持多个 GPU,但在我的情况下,GPU ID 2 是导致错误的原因,可能是因为它被主机操作系统完全使用或类似的原因。假设我有 4 个 GPU,如果我导出 CUDA_VISIBLE_DEVICES=0,1,2,3 并尝试在 Python 中使用
import cudf
,则会按照上面的方式出错。但如果我导出 CUDA_VISIBLE_DEVICES=0,1,3 一切正常。
事实上,运行
numba -s
时,它会将所有 3 个 GPU 识别为 0、1、2,因此似乎会根据环境变量公开的 GPU 重置其索引。此外,当使用 XGBoost 时,我可以分别使用 ID 0、1、2 来定位通过环境变量公开的所有 3 个 GPU。