即使 GPU 被识别，Tensorflow 也无法选择 GPU

Question

我尝试设置张量流以使用我的 GPU (gtx 1070) 运行。

我安装了最新的 nvidia 驱动程序

546.29-desktop-win10-win11-64bit-international-dch-whql.exe

来自

nvidia-sim

的输出

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 546.29                 Driver Version: 546.29       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1070      WDDM  | 00000000:05:00.0  On |                  N/A |
|  0%   61C    P0              37W / 230W |   1714MiB /  8192MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

为了使用 GPU，我遵循了 Docker 的 Tensorflow 安装指南

这是我的 Dockerfile

FROM tensorflow/tensorflow:latest-gpu

WORKDIR /tf-gpu

COPY requirements.txt requirements.txt

RUN pip install -r requirements.txt

EXPOSE 8888

ENTRYPOINT [ "jupyter", "lab", "--ip=0.0.0.0", "--allow-root", "--no-browser" ]

这是我的 docker-compose.yaml

version: '1.0'
services:
  jupyter-lab:
    build: .
    ports:
      - 8888:8888
    volumes:
      - ./tf-gpu:/tf-gpu
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

这是requirements.txt（我希望jupyterlab在容器上运行）

jupyterlab
pandas
matplotlib

第一个问题指示（也许？）

现在用

docker-compose up

启动此容器后，我在 .ipynb 文件中导入了tensorflow，并收到以下错误

import tensorflow as tf

2023-12-07 23：53：31.948645：E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261]无法注册cuDNN工厂：在已经注册了插件cuDNN工厂的情况下尝试注册工厂 2023-12-07 23:53:31.948714：E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] 无法注册 cuFFT 工厂：在已注册插件 cuFFT 工厂的情况下尝试注册工厂 2023-12-07 23：53：31.949559：E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515]无法注册cuBLAS工厂：在已经注册了插件cuBLAS工厂的情况下尝试注册工厂 2023-12-07 23:53:31.955580：I tensorflow/core/platform/cpu_feature_guard.cc:182] 此 TensorFlow 二进制文件经过优化，可以在性能关键型操作中使用可用的 CPU 指令。要启用以下指令：AVX2 FMA，在其他操作中，使用适当的编译器标志重建 TensorFlow。

我正在运行 Ryzen 5 3600。我不知道这是否相关。

真正的问题

然后我尝试看看GPU是否被识别。

print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
tf.config.list_logical_devices()

Num GPUs Available:  1

2023-12-08 00:04:28.072720: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.105405: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.105450: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.107896: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.107935: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.107952: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.307162: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.307210: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.307218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2022] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2023-12-08 00:04:28.307249: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.307745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6731 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1070, pci bus id: 0000:05:00.0, compute capability: 6.1

[LogicalDevice(name='/device:CPU:0', device_type='CPU'),
 LogicalDevice(name='/device:GPU:0', device_type='GPU')]

显然 GPU 确实被识别了

但是，当我运行这个时

with tf.device('/device:GPU:1'):
    a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
    b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])

    # Run on the GPU
    c = tf.matmul(a, b)
    print(c)

没有发生运行时错误？？

此外，当我尝试训练任何类型的模型时，当我在没有 GPU 和 Intel i7-1165G7 @ 2.80 GHz 的笔记本电脑上训练时，训练速度大致相同。

例如 mnist 示例

import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Flatten, MaxPooling2D
from tensorflow.keras.utils import to_categorical

print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

# Load MNIST dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Preprocessing the data
train_images = train_images.reshape((train_images.shape[0], 28, 28, 1))
test_images = test_images.reshape((test_images.shape[0], 28, 28, 1))

# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

# Convert labels to one-hot encoding
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

# Build the model
model = Sequential([
    Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1)),
    MaxPooling2D(pool_size=(2, 2)),
    Flatten(),
    Dense(64, activation='relu'),
    Dense(10, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(train_images, train_labels, validation_data=(test_images, test_labels), epochs=5)

# Evaluate the model
model.evaluate(test_images, test_labels)

产生此输出并花费与我的笔记本电脑相同的时间

Num GPUs Available:  1

Epoch 1/5
2023-12-08 00:16:20.796334: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
2023-12-08 00:16:21.312037: I external/local_xla/xla/service/service.cc:168] XLA service 0x7efafe8c24a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-12-08 00:16:21.312088: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce GTX 1070, Compute Capability 6.1
2023-12-08 00:16:21.328170: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1701994581.439258     134 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
1875/1875 [==============================] - 15s 7ms/step - loss: 0.1586 - accuracy: 0.9532 - val_loss: 0.0587 - val_accuracy: 0.9805
Epoch 2/5
1875/1875 [==============================] - 13s 7ms/step - loss: 0.0553 - accuracy: 0.9826 - val_loss: 0.0558 - val_accuracy: 0.9809
Epoch 3/5
1875/1875 [==============================] - 14s 7ms/step - loss: 0.0379 - accuracy: 0.9883 - val_loss: 0.0408 - val_accuracy: 0.9862
Epoch 4/5
1875/1875 [==============================] - 15s 8ms/step - loss: 0.0258 - accuracy: 0.9919 - val_loss: 0.0450 - val_accuracy: 0.9844
Epoch 5/5
1875/1875 [==============================] - 14s 8ms/step - loss: 0.0174 - accuracy: 0.9945 - val_loss: 0.0446 - val_accuracy: 0.9864
313/313 [==============================] - 2s 6ms/step - loss: 0.0446 - accuracy: 0.9864

[0.044628895819187164, 0.9864000082015991]

我完全不知道如何解决这个问题。感谢您提前提供任何帮助。

Answer 1

根据给定的信息，我不完全确定导致问题的原因。

我能想到的很少

您的 cuDNN 和张量流版本不兼容。检查此兼容性矩阵 - https://www.tensorflow.org/install/source
启用 NUMA 支持，非 NUMA 内核可能会减慢机器学习任务的速度，因为它必须多次访问内存。 NUMA 高效地做到了这一点

检查 NUMA 支持： cat /boot/config-$(uname -r) | grep CONFIG_NUMA 然后重新编译内核

在某些情况下，您可以使用 NUMA 模拟来解决缺乏本机 NUMA 支持的问题。 numactl --interleave=所有您的应用程序

即使 GPU 被识别，Tensorflow 也无法选择 GPU

问题描述投票：0回答：1

我尝试设置张量流以使用我的 GPU (gtx 1070) 运行。

第一个问题指示（也许？）

真正的问题

1个回答

最新问题

即使 GPU 被识别，Tensorflow 也无法选择 GPU

问题描述 投票：0回答：1

我尝试设置张量流以使用我的 GPU (gtx 1070) 运行。

第一个问题指示（也许？）

真正的问题

1个回答

最新问题

问题描述投票：0回答：1