rapids 无法导入 cudf:驱动程序初始化时出错:调用 cuInit 会导致 CUDA_ERROR_NO_DEVICE (100)

问题描述 投票:0回答:2

要安装 RAPIDS,我已经安装了 WSL2。

但是导入cudf时仍然出现以下错误:

/home/zy-wsl/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/cudf/utils/_ptxcompiler.py:61: UserWarning: Error getting driver and runtime versions:

stdout:



stderr:

Traceback (most recent call last):
  File "/home/zy-wsl/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 258, in ensure_initialized
    self.cuInit(0)
  File "/home/zy-wsl/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 331, in safe_cuda_api_call
    self._check_ctypes_error(fname, retcode)
  File "/home/zy-wsl/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 399, in _check_ctypes_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [100] Call to cuInit results in CUDA_ERROR_NO_DEVICE

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 4, in <module>
  File "/home/zy-wsl/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 296, in __getattr__
    self.ensure_initialized()
  File "/home/zy-wsl/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 262, in ensure_initialized
    raise CudaSupportError(f"Error at driver init: {description}")
...


Not patching Numba
  warnings.warn(msg, UserWarning)
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...
---------------------------------------------------------------------------
CudaSupportError                          Traceback (most recent call last)
/mnt/d/learn-rapids/Untitled.ipynb Cell 4 line 1
----> 1 import cudf

File ~/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/cudf/__init__.py:26
     20 from cudf.api.extensions import (
     21     register_dataframe_accessor,
     22     register_index_accessor,
     23     register_series_accessor,
     24 )
     25 from cudf.api.types import dtype
---> 26 from cudf.core.algorithms import factorize
     27 from cudf.core.cut import cut
     28 from cudf.core.dataframe import DataFrame, from_dataframe, from_pandas, merge

File ~/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/cudf/core/algorithms.py:10
      8 from cudf.core.copy_types import BooleanMask
      9 from cudf.core.index import RangeIndex, as_index
---> 10 from cudf.core.indexed_frame import IndexedFrame
     11 from cudf.core.scalar import Scalar
     12 from cudf.options import get_option

File ~/miniconda3/envs/rapids-23.12/lib/python3.10/site-packages/cudf/core/indexed_frame.py:59
     57 from cudf.core.dtypes import ListDtype
...
    302 if USE_NV_BINDING:
    303     return self._cuda_python_wrap_fn(fname)

CudaSupportError: Error at driver init: 
Call to cuInit results in CUDA_ERROR_NO_DEVICE (100):

尝试了下面最新的安装行:

conda create --solver=libmamba -n rapids-23.12 -c rapidsai-nightly -c conda-forge -c nvidia  \
    cudf=23.12 cuml=23.12 python=3.10 cuda-version=12.0 \
    jupyterlab
 NVIDIA-SMI 545.23.05              Driver Version: 545.84       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               On  | 00000000:01:00.0  On |                  Off |
| 30%   53C    P3              54W / 300W |   1783MiB / 49140MiB |     10%      Default |
|                                         |                      |                  N/A

此外,cudf 已在 conda 环境中:

cudf                      23.12.00a       cuda12_py310_231028_g2a923dfff8_124    rapidsai-nightly
cuml                      23.12.00a       cuda12_py310_231028_gff635fc25_31    rapidsai-nightly

我还尝试在 wsl 环境中使用 numba-s,并发现以下内容:

__CUDA Information__
CUDA Device Initialized                       : False
CUDA Driver Version                           : ?
CUDA Runtime Version                          : ?
CUDA NVIDIA Bindings Available                : ?
CUDA NVIDIA Bindings In Use                   : ?
CUDA Minor Version Compatibility Available    : ?
CUDA Minor Version Compatibility Needed       : ?
CUDA Minor Version Compatibility In Use       : ?
CUDA Detect Output:
None
CUDA Libraries Test Output:
None

__Warning log__
Warning (cuda): CUDA device initialisation problem. Message:Error at driver init: Call to cuInit results in CUDA_ERROR_NO_DEVICE (100)
Exception class: <class 'numba.cuda.cudadrv.error.CudaSupportError'>
Warning (no file): /sys/fs/cgroup/cpuacct/cpu.cfs_quota_us
Warning (no file): /sys/fs/cgroup/cpuacct/cpu.cfs_period_us

似乎 CUDA 没有在 wsl 中启动,但是当我在 Windows 提示符下运行此命令时,它返回:

__CUDA Information__
CUDA Device Initialized                       : True
CUDA Driver Version                           : ?
CUDA Runtime Version                          : ?
CUDA NVIDIA Bindings Available                : ?
CUDA NVIDIA Bindings In Use                   : ?
CUDA Minor Version Compatibility Available    : ?
CUDA Minor Version Compatibility Needed       : ?
CUDA Minor Version Compatibility In Use       : ?
CUDA Detect Output:
Found 1 CUDA devices
id 0     b'NVIDIA RTX A6000'                              [SUPPORTED]
                      Compute Capability: 8.6
                           PCI Device ID: 0
                              PCI Bus ID: 1
                                    UUID: GPU-17e7be94-251e-a2d9-3924-d167c0e59a56
                                Watchdog: Enabled
                            Compute Mode: WDDM
             FP32/FP64 Performance Ratio: 32
Summary:
        1/1 devices are supported

CUDA Libraries Test Output:
None
__Warning log__
Warning (cuda): Probing CUDA failed (device and driver present, runtime problem?)
(cuda) <class 'FileNotFoundError'>: Could not find module 'cudart.dll' (or one of its dependencies). Try using the full path with constructor syntax.
python machine-learning cuda rapids
2个回答
0
投票

问题已经解决了。执行以下操作在 nano .bashrc 中注册 wsl实例下:

sudo nano .bashrc

插入以下内容:

export LD_LIBRARY_PATH="/usr/lib/wsl/lib/"  
export NUMBA_CUDA_DRIVER="/usr/lib/wsl/lib/libcuda.so.1"

然后:

source .bashrc

0
投票

如果这对其他人有帮助,我收到了类似的错误

numba.cuda.cudadrv.error.CudaSupportError: Error at driver init: Call to cuInit results in CUDA_ERROR_OUT_OF_MEMORY (2)
,系统配置如下:

  • 主机操作系统 Microsoft Windows 11 Pro 版本 10.0.22621 Build 22621
  • 在主机上运行最新的 NVIDIA 驱动程序 (546.33)
  • 全新安装 Ubuntu 22.04.3 LTS 的 WSL2
  • 已安装Miniconda3-py310_23.11.0-2-Linux-x86_64.sh
  • 通过 WSL2 Conda Install 安装 RAPIDS(首选方法)
  • WSL2中执行的具体命令
    conda create --solver=libmamba -n rapids-23.12 -c rapidsai -c conda-forge -c nvidia  rapids=23.12 python=3.10 cuda-version=12.0
  • 激活新创建的rapids-23.12 Conda环境

就我而言,因为我有 4 个独立 GPU,所以 WSL 内部的事情很混乱。

我的错误仅限于那些使用 WSL2 且其设置中存在多个 GPU 的人。我记得读过 WSL2 仅支持 1 个 GPU(https://docs.rapids.ai/install#wsl2-conda:“仅支持单个 GPU”和“不支持 GPU 直接存储”)。但没有详细记录表明您需要帮助 Python 定位受支持的特定 GPU。

为了克服这个错误,有必要明确规定 CUDA_VISIBLE_DEVICES 环境变量,我建议通过添加以下行来将其作为 ~/.bashrc 中的环境变量: 导出 CUDA_VISIBLE_DEVICES=0

请注意,这是零索引,并且是 GPU 的 ID。

但是,经过一些实验,我发现通过 Conda 在 WSL2 上安装 RAPIDS 确实支持多个 GPU,但在我的情况下,GPU ID 2 是导致错误的原因,可能是因为它被主机操作系统完全使用或类似的原因。假设我有 4 个 GPU,如果我导出 CUDA_VISIBLE_DEVICES=0,1,2,3 并尝试在 Python 中使用

import cudf
,则会按照上面的方式出错。但如果我导出 CUDA_VISIBLE_DEVICES=0,1,3 一切正常。

事实上,运行

numba -s
时,它会将所有 3 个 GPU 识别为 0、1、2,因此似乎会根据环境变量公开的 GPU 重置其索引。此外,当使用 XGBoost 时,我可以分别使用 ID 0、1、2 来定位通过环境变量公开的所有 3 个 GPU。

© www.soinside.com 2019 - 2024. All rights reserved.