546.29-desktop-win10-win11-64bit-international-dch-whql.exe
来自
nvidia-sim
的输出
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 546.29 Driver Version: 546.29 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1070 WDDM | 00000000:05:00.0 On | N/A |
| 0% 61C P0 37W / 230W | 1714MiB / 8192MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
这是我的 Dockerfile
FROM tensorflow/tensorflow:latest-gpu
WORKDIR /tf-gpu
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt
EXPOSE 8888
ENTRYPOINT [ "jupyter", "lab", "--ip=0.0.0.0", "--allow-root", "--no-browser" ]
这是我的 docker-compose.yaml
version: '1.0'
services:
jupyter-lab:
build: .
ports:
- 8888:8888
volumes:
- ./tf-gpu:/tf-gpu
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
这是requirements.txt(我希望jupyterlab在容器上运行)
jupyterlab
pandas
matplotlib
现在用
docker-compose up
启动此容器后,我在 .ipynb 文件中导入了tensorflow,并收到以下错误
import tensorflow as tf
2023-12-07 23:53:31.948645:E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261]无法注册cuDNN工厂:在已经注册了插件cuDNN工厂的情况下尝试注册工厂 2023-12-07 23:53:31.948714:E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] 无法注册 cuFFT 工厂:在已注册插件 cuFFT 工厂的情况下尝试注册工厂 2023-12-07 23:53:31.949559:E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515]无法注册cuBLAS工厂:在已经注册了插件cuBLAS工厂的情况下尝试注册工厂 2023-12-07 23:53:31.955580:I tensorflow/core/platform/cpu_feature_guard.cc:182] 此 TensorFlow 二进制文件经过优化,可以在性能关键型操作中使用可用的 CPU 指令。 要启用以下指令:AVX2 FMA,在其他操作中,使用适当的编译器标志重建 TensorFlow。
我正在运行 Ryzen 5 3600。我不知道这是否相关。
然后我尝试看看GPU是否被识别。
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
tf.config.list_logical_devices()
Num GPUs Available: 1
2023-12-08 00:04:28.072720: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.105405: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.105450: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.107896: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.107935: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.107952: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.307162: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.307210: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.307218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2022] Could not identify NUMA node of platform GPU id 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-12-08 00:04:28.307249: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:05:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-12-08 00:04:28.307745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6731 MB memory: -> device: 0, name: NVIDIA GeForce GTX 1070, pci bus id: 0000:05:00.0, compute capability: 6.1
[LogicalDevice(name='/device:CPU:0', device_type='CPU'),
LogicalDevice(name='/device:GPU:0', device_type='GPU')]
显然 GPU 确实被识别了
但是,当我运行这个时
with tf.device('/device:GPU:1'):
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
# Run on the GPU
c = tf.matmul(a, b)
print(c)
没有发生运行时错误??
此外,当我尝试训练任何类型的模型时,当我在没有 GPU 和 Intel i7-1165G7 @ 2.80 GHz 的笔记本电脑上训练时,训练速度大致相同。
例如 mnist 示例
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Flatten, MaxPooling2D
from tensorflow.keras.utils import to_categorical
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
# Load MNIST dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
# Preprocessing the data
train_images = train_images.reshape((train_images.shape[0], 28, 28, 1))
test_images = test_images.reshape((test_images.shape[0], 28, 28, 1))
# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0
# Convert labels to one-hot encoding
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
# Build the model
model = Sequential([
Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1)),
MaxPooling2D(pool_size=(2, 2)),
Flatten(),
Dense(64, activation='relu'),
Dense(10, activation='softmax')
])
# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(train_images, train_labels, validation_data=(test_images, test_labels), epochs=5)
# Evaluate the model
model.evaluate(test_images, test_labels)
产生此输出并花费与我的笔记本电脑相同的时间
Num GPUs Available: 1
Epoch 1/5
2023-12-08 00:16:20.796334: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
2023-12-08 00:16:21.312037: I external/local_xla/xla/service/service.cc:168] XLA service 0x7efafe8c24a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-12-08 00:16:21.312088: I external/local_xla/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA GeForce GTX 1070, Compute Capability 6.1
2023-12-08 00:16:21.328170: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1701994581.439258 134 device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
1875/1875 [==============================] - 15s 7ms/step - loss: 0.1586 - accuracy: 0.9532 - val_loss: 0.0587 - val_accuracy: 0.9805
Epoch 2/5
1875/1875 [==============================] - 13s 7ms/step - loss: 0.0553 - accuracy: 0.9826 - val_loss: 0.0558 - val_accuracy: 0.9809
Epoch 3/5
1875/1875 [==============================] - 14s 7ms/step - loss: 0.0379 - accuracy: 0.9883 - val_loss: 0.0408 - val_accuracy: 0.9862
Epoch 4/5
1875/1875 [==============================] - 15s 8ms/step - loss: 0.0258 - accuracy: 0.9919 - val_loss: 0.0450 - val_accuracy: 0.9844
Epoch 5/5
1875/1875 [==============================] - 14s 8ms/step - loss: 0.0174 - accuracy: 0.9945 - val_loss: 0.0446 - val_accuracy: 0.9864
313/313 [==============================] - 2s 6ms/step - loss: 0.0446 - accuracy: 0.9864
[0.044628895819187164, 0.9864000082015991]
我完全不知道如何解决这个问题。 感谢您提前提供任何帮助。
根据给定的信息,我不完全确定导致问题的原因。
我能想到的很少
您的 cuDNN 和张量流版本不兼容。检查此兼容性矩阵 - https://www.tensorflow.org/install/source
启用 NUMA 支持,非 NUMA 内核可能会减慢机器学习任务的速度,因为它必须多次访问内存。 NUMA 高效地做到了这一点
检查 NUMA 支持: cat /boot/config-$(uname -r) | grep CONFIG_NUMA 然后重新编译内核
在某些情况下,您可以使用 NUMA 模拟来解决缺乏本机 NUMA 支持的问题。 numactl --interleave=所有您的应用程序