[ LINUX ]Tensorflow-GPU 不工作 - TF-TRT 警告:找不到 TensorRT

问题描述 投票:0回答:1

我一直在努力下载 tesnorflow-gpu 库所需的所有必要驱动程序。我想使用 GPU 而不是 CPU 来编译我的模型。我正在使用 Linux Mint。这是我的neofetch

             ...-:::::-...                 
          .-MMMMMMMMMMMMMMM-.              ----------- 
      .-MMMM`..-:::::::-..`MMMM-.          OS: Linux Mint 21.3 x86_64 
    .:MMMM.:MMMMMMMMMMMMMMM:.MMMM:.        Kernel: 5.15.0-101-generic 
   -MMM-M---MMMMMMMMMMMMMMMMMMM.MMM-       Uptime: 2 hours, 33 mins 
 `:MMM:MM`  :MMMM:....::-...-MMMM:MMM:`    Packages: 3307 (dpkg), 13 (flatpak) 
 :MMM:MMM`  :MM:`  ``    ``  `:MMM:MMM:    Shell: bash 5.1.16 
.MMM.MMMM`  :MM.  -MM.  .MM-  `MMMM.MMM.   Resolution: 1920x1080 
:MMM:MMMM`  :MM.  -MM-  .MM:  `MMMM-MMM:   DE: Cinnamon 
:MMM:MMMM`  :MM.  -MM-  .MM:  `MMMM:MMM:   WM: Mutter (Muffin) 
:MMM:MMMM`  :MM.  -MM-  .MM:  `MMMM-MMM:   WM Theme: WhiteSur-Dark (Sweet-Dark-v40) 
.MMM.MMMM`  :MM:--:MM:--:MM:  `MMMM.MMM.   Theme: Sweet-Dark-v40 [GTK2/3] 
 :MMM:MMM-  `-MMMMMMMMMMMM-`  -MMM-MMM:    Icons: candy-icons [GTK2/3] 
  :MMM:MMM:`                `:MMM:MMM:     Terminal: gnome-terminal 
   .MMM.MMMM:--------------:MMMM.MMM.      CPU: Intel i5-3570 (4) @ 3.800GHz 
     '-MMMM.-MMMMMMMMMMMMMMM-.MMMM-'       GPU: NVIDIA GeForce GTX 1060 6GB 
       '.-MMMM``--:::::--``MMMM-.'         Memory: 2070MiB / 7883MiB 
            '-MMMMMMMMMMMMM-'
               ``-:::::-`` 

                                    
                                                               

这是我的 nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1060 6GB    Off |   00000000:01:00.0  On |                  N/A |
| 25%   40C    P8              7W /  120W |     316MiB /   6144MiB |     11%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1142      G   /usr/lib/xorg/Xorg                            148MiB |
|    0   N/A  N/A      1881      G   cinnamon                                       45MiB |
|    0   N/A  N/A      9746      G   /app/extra/viber/Viber                         27MiB |
|    0   N/A  N/A     14935      G   ...seed-version=20240322-165906.502000         90MiB |
+-----------------------------------------------------------------------------------------+

我还安装了tensorrt、cuDNN 和tensorflow-gpu。大多数安装都是使用 pip 进行的。这是我的张量版本

import tensorrt
print(tensorrt.__version__)
8.6.1
assert tensorrt.Builder(tensorrt.Logger())

我收到的错误如下:

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

2024-03-25 12:49:24.151959: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-03-25 12:49:24.939265: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-03-25 12:49:24.973806: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]

我不确定冲突是由于版本不匹配还是由于路径引起的,但是当我回显路径时,我得到 回显$LD_LIBRARY_PATH :/home/vuk/miniconda3/lib/python3.1/site-packages/tensorrt_libs

这已经困扰我几个月了......

我多次尝试安装和卸载库、配置路径、通过docker安装tensorflow gpu,但到目前为止没有任何效果。问题可能在于我正在使用的库不匹配,但我不确定......

linux tensorflow nvidia tensorrt numa
1个回答
0
投票

我通过为我的 conda 环境创建两个 bash 脚本来解决这个问题。在 Conda 环境目录中,导航到 etc/conda/activate.d 和 etc/conda/deactivate.d 目录。如果这些目录不存在,您可以创建它们。然后,在两个目录中创建一个脚本文件(例如 set_env_vars.sh)。

第一个是 activate.d ,如下所示:

#!/bin/sh
export NVIDIA_DIR=$(dirname $(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)")))
export LD_LIBRARY_PATH=$(echo ${NVIDIA_DIR}/*/lib/ | sed -r 's/\s+/:/g')${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

第二个是deactivate.d,其中包含:

#!/bin/sh
unset NVIDIA_DIR
unset LD_LIBRARY_PATH

然后我给两者添加了执行权限

chmod +x /path/to/your/conda/env/etc/conda/activate.d/set_env_vars.sh
chmod +x /path/to/your/conda/env/etc/conda/deactivate.d/set_env_vars.sh

© www.soinside.com 2019 - 2024. All rights reserved.