即使 GPU 被识别,Tensorflow 也无法选择 GPU

问题描述 投票:0回答:1

我尝试设置张量流以使用我的 GPU (gtx 1070) 运行。

  1. 我安装了最新的 nvidia 驱动程序
    546.29-desktop-win10-win11-64bit-international-dch-whql.exe

来自

nvidia-sim

的输出
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 546.29                 Driver Version: 546.29       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1070      WDDM  | 00000000:05:00.0  On |                  N/A |
|  0%   61C    P0              37W / 230W |   1714MiB /  8192MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
  1. 为了使用 GPU,我遵循了 Docker 的 Tensorflow 安装指南

这是我的 Dockerfile

FROM tensorflow/tensorflow:latest-gpu

WORKDIR /tf-gpu

COPY requirements.txt requirements.txt

RUN pip install -r requirements.txt

EXPOSE 8888

ENTRYPOINT [ "jupyter", "lab", "--ip=0.0.0.0", "--allow-root", "--no-browser" ]

这是我的 docker-compose.yaml

version: '1.0'
services:
  jupyter-lab:
    build: .
    ports:
      - 8888:8888
    volumes:
      - ./tf-gpu:/tf-gpu
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

这是requirements.txt(我希望jupyterlab在容器上运行)

jupyterlab
pandas
matplotlib

第一个问题指示(也许?)

现在用

docker-compose up
启动此容器后,我在 .ipynb 文件中导入了tensorflow,并收到以下错误

import tensorflow as tf

2023-12-07 23:53:31.948645:E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261]无法注册cuDNN工厂:在已经注册了插件cuDNN工厂的情况下尝试注册工厂 2023-12-07 23:53:31.948714:E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] 无法注册 cuFFT 工厂:在已注册插件 cuFFT 工厂的情况下尝试注册工厂 2023-12-07 23:53:31.949559:E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515]无法注册cuBLAS工厂:在已经注册了插件cuBLAS工厂的情况下尝试注册工厂 2023-12-07 23:53:31.955580:I tensorflow/core/platform/cpu_feature_guard.cc:182] 此 TensorFlow 二进制文件经过优化,可以在性能关键型操作中使用可用的 CPU 指令。 要启用以下指令:AVX2 FMA,在其他操作中,使用适当的编译器标志重建 TensorFlow。

我正在运行 Ryzen 5 3600。我不知道这是否相关。

真正的问题

然后我尝试看看GPU是否被识别。

print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
tf.config.list_logical_devices()
Num GPUs Available:  1

2023-12-08 00:04:28.072720: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.105405: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.105450: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.107896: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.107935: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.107952: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.307162: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.307210: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.307218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2022] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2023-12-08 00:04:28.307249: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.307745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6731 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1070, pci bus id: 0000:05:00.0, compute capability: 6.1

[LogicalDevice(name='/device:CPU:0', device_type='CPU'),
 LogicalDevice(name='/device:GPU:0', device_type='GPU')]

显然 GPU 确实被识别了

但是,当我运行这个时

with tf.device('/device:GPU:1'):
    a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
    b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])

    # Run on the GPU
    c = tf.matmul(a, b)
    print(c)

没有发生运行时错误??

此外,当我尝试训练任何类型的模型时,当我在没有 GPU 和 Intel i7-1165G7 @ 2.80 GHz 的笔记本电脑上训练时,训练速度大致相同。

例如 mnist 示例

import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Flatten, MaxPooling2D
from tensorflow.keras.utils import to_categorical

print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

# Load MNIST dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Preprocessing the data
train_images = train_images.reshape((train_images.shape[0], 28, 28, 1))
test_images = test_images.reshape((test_images.shape[0], 28, 28, 1))

# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

# Convert labels to one-hot encoding
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

# Build the model
model = Sequential([
    Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1)),
    MaxPooling2D(pool_size=(2, 2)),
    Flatten(),
    Dense(64, activation='relu'),
    Dense(10, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(train_images, train_labels, validation_data=(test_images, test_labels), epochs=5)

# Evaluate the model
model.evaluate(test_images, test_labels)

产生此输出并花费与我的笔记本电脑相同的时间

Num GPUs Available:  1

Epoch 1/5
2023-12-08 00:16:20.796334: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
2023-12-08 00:16:21.312037: I external/local_xla/xla/service/service.cc:168] XLA service 0x7efafe8c24a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-12-08 00:16:21.312088: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce GTX 1070, Compute Capability 6.1
2023-12-08 00:16:21.328170: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1701994581.439258     134 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
1875/1875 [==============================] - 15s 7ms/step - loss: 0.1586 - accuracy: 0.9532 - val_loss: 0.0587 - val_accuracy: 0.9805
Epoch 2/5
1875/1875 [==============================] - 13s 7ms/step - loss: 0.0553 - accuracy: 0.9826 - val_loss: 0.0558 - val_accuracy: 0.9809
Epoch 3/5
1875/1875 [==============================] - 14s 7ms/step - loss: 0.0379 - accuracy: 0.9883 - val_loss: 0.0408 - val_accuracy: 0.9862
Epoch 4/5
1875/1875 [==============================] - 15s 8ms/step - loss: 0.0258 - accuracy: 0.9919 - val_loss: 0.0450 - val_accuracy: 0.9844
Epoch 5/5
1875/1875 [==============================] - 14s 8ms/step - loss: 0.0174 - accuracy: 0.9945 - val_loss: 0.0446 - val_accuracy: 0.9864
313/313 [==============================] - 2s 6ms/step - loss: 0.0446 - accuracy: 0.9864

[0.044628895819187164, 0.9864000082015991]

我完全不知道如何解决这个问题。 感谢您提前提供任何帮助。

python docker tensorflow deep-learning cuda
1个回答
0
投票

根据给定的信息,我不完全确定导致问题的原因。

我能想到的很少

  1. 您的 cuDNN 和张量流版本不兼容。检查此兼容性矩阵 - https://www.tensorflow.org/install/source

  2. 启用 NUMA 支持,非 NUMA 内核可能会减慢机器学习任务的速度,因为它必须多次访问内存。 NUMA 高效地做到了这一点

    检查 NUMA 支持: cat /boot/config-$(uname -r) | grep CONFIG_NUMA 然后重新编译内核

    在某些情况下,您可以使用 NUMA 模拟来解决缺乏本机 NUMA 支持的问题。 numactl --interleave=所有您的应用程序

© www.soinside.com 2019 - 2024. All rights reserved.