我想在我的Multi GPU系统上使用tf.contrib.distribute.MirroredStrategy()但它不使用GPU进行训练(参见下面的输出)。我也在运行tensorflow-gpu 1.12。
我确实尝试直接在MirroredStrategy中指定GPU,但出现了同样的问题。
model = models.Model(inputs=input, outputs=y_output)
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
model.compile(loss=lossFunc, optimizer=optimizer)
NUM_GPUS = 2
strategy = tf.contrib.distribute.MirroredStrategy(num_gpus=NUM_GPUS)
config = tf.estimator.RunConfig(train_distribute=strategy)
estimator = tf.keras.estimator.model_to_estimator(model,
config=config)
这些是我得到的结果:
INFO:tensorflow:Device is available but not used by distribute strategy: /device:CPU:0
INFO:tensorflow:Device is available but not used by distribute strategy: /device:GPU:0
INFO:tensorflow:Device is available but not used by distribute strategy: /device:GPU:1
WARNING:tensorflow:Not all devices in DistributionStrategy are visible to TensorFlow session.
预期的结果显然是在Multi GPU系统上运行训练。那些已知的问题?
我一直面临类似的问题,MirroredStrategy在张量流1.13.1上失败,2x RTX2080运行Estimator。
失败似乎是在NCCL all_reduce方法中(错误消息 - 没有为NCCL AllReduce注册的OpKernel)。
我通过从NCCL更改为hierarchical_copy来运行它,这意味着使用contrib cross_device_ops方法,如下所示:
命令失败:
mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1"])
成功的命令:
mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1"],
cross_device_ops=tf.contrib.distribute.AllReduceCrossDeviceOps(
all_reduce_alg="hierarchical_copy")
)