错误:但是这台机器只有:['/ cpu:0']。 -但识别出2 gpus

问题描述 投票:0回答:1

因此,我用2个titan xp搭建了一个装备,并遵循https://github.com/awslabs/keras-apache-mxnet/wiki/Multi-GPU-Model-Training-with-Keras-MXNet中的多GPU训练示例。我只更改了两段代码。型号部分中的gpus=4和批量大小部分中的batchsize=32*2

我收到这个奇怪的错误,因为在第一部分中它实际上显示了我的GPU(计算等),但是在错误的最后部分中,它仅识别出我的CPU:

2019-11-19 10:43:32.935282: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-19 10:43:32.940953: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-11-19 10:43:33.115668: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-19 10:43:33.116756: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x27557f0 executing computations on platform CUDA. Devices:
2019-11-19 10:43:33.116793: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): TITAN Xp COLLECTORS EDITION, Compute Capability 6.1
2019-11-19 10:43:33.116799: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (1): TITAN Xp COLLECTORS EDITION, Compute Capability 6.1
2019-11-19 10:43:33.135701: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3500025000 Hz
2019-11-19 10:43:33.137115: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x277ba60 executing computations on platform Host. Devices:
2019-11-19 10:43:33.137144: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-11-19 10:43:33.139168: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device0 with properties: name: TITAN Xp COLLECTORS EDITION major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:0a:00.0
2019-11-19 10:43:33.139381: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-19 10:43:33.140815: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties: name: TITAN Xp COLLECTORS EDITION major: 6 minor: 1 memoryClockRate(GHz): 1.582pciBusID: 0000:41:00.0
2019-11-19 10:43:33.141201: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2019-11-19 10:43:33.141268: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
2019-11-19 10:43:33.141330: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
2019-11-19 10:43:33.141389: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
2019-11-19 10:43:33.141452: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
2019-11-19 10:43:33.141512: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
2019-11-19 10:43:33.207406: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-11-19 10:43:33.207452: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2019-11-19 10:43:33.207550: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-19 10:43:33.207568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 1 
2019-11-19 10:43:33.207578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N Y 
2019-11-19 10:43:33.207584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1:   Y N 
2019-11-19 10:43:33.229007: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  
 If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
Traceback (most recent call last):
File "multi-gpu.py", line 42, in <module>
model = keras.utils.multi_gpu_model(model, gpus=2)
File "/home/gormosity/.local/lib/python3.6/site-packages/keras/utils/multi_gpu_utils.py", line 184, in multi_gpu_model available_devices))
ValueError: To call `multi_gpu_model` with `gpus=2`, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1']. However this machine only has: ['/cpu:0']. Try reducing `gpus`.

nvidia-smi

| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp COLLEC...  On   | 00000000:0A:00.0 Off |                  N/A |
| 23%   24C    P8    10W / 250W |    157MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp COLLEC...  On   | 00000000:41:00.0  On |                  N/A |
| 23%   36C    P5    27W / 250W |    460MiB / 12192MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      3860      C   python3                                      145MiB |
|    1      1253      G   /usr/lib/xorg/Xorg                            18MiB |
|    1      1282      G   /usr/bin/gnome-shell                          51MiB |
|    1      1650      G   /usr/lib/xorg/Xorg                           116MiB |
|    1      1781      G   /usr/bin/gnome-shell                         124MiB |
|    1      3860      C   python3                                      145MiB |
+-----------------------------------------------------------------------------+
python keras gpu-programming
1个回答
1
投票

您的错误消息显示tensorflow作为后端(cuda 10.1可能存在兼容性问题-如果您自己未编译它,也许这是这里的问题),也许您也需要安装mxnet-cu101(当然,如果需要,使用mxnet作为后端,但如果没有,则不适合使用keras-mxnet)。您可以尝试将后端更改为mxnet backend

© www.soinside.com 2019 - 2024. All rights reserved.