TPU 连接问题训练 TF 模型 Google Colab

Question

我在 CPU 和 GPU 上构建了一个可用的 Tensorflow 神经网络模型。由于数据集很大，我现在正在尝试让模型在 TPU 上进行训练。我已经像往常一样初始化了 TPU 策略：

tpu_resolver = tf.distribute.cluster_resolver.TPUClusterResolver()  # Automatically detects the TPU
tf.config.experimental_connect_to_cluster(tpu_resolver)  # Connects to the TPU cluster
tf.tpu.experimental.initialize_tpu_system(tpu_resolver)  # Initializes the TPU system
strategy = tf.distribute.TPUStrategy(tpu_resolver)
tpu_device = tpu_resolver.master()  # Retrieves the TPU device URI
print("Running on TPU:", tpu_device)

这会产生以下打印结果：

Running on TPU: grpc://10.74.203.82:8470

但是，当我在 Strategy.scope() 下训练模型时，出现以下错误，并且训练停止。

err: File "/content/SeniorHonoursProject/BaCoN-II/train.py", line 169, in my_train
err: new_history = model.fit(train_dataset.dataset, epochs=epochs,
err: File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
err: raise e.with_traceback(filtered_tb) from None
err: File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py", line 362, in _numpy
err: raise core._status_to_exception(e) from None  # pylint: disable=protected-access
err: tensorflow.python.framework.errors_impl.InternalError: 8 root error(s) found.
err: (0) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}]]
err: Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
err: [[RemoteCall]]
err: [[IteratorGetNextAsOptional]]
err: [[TPUReplicate/_compile/_9902494219978988908/_4/_384]]
err: (1) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}]]
err: Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
err: [[RemoteCall]]
err: [[IteratorGetNextAsOptional]]
err: [[Pad_3/_250]]
err: (2) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}]]
err: Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
err: [[RemoteCall]]
err: [[IteratorGetNextAsOptional]]
err: [[Pad_15/_466]]
err: (3) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}] ... [truncated]
err: Exception ignored in atexit callback: <function async_wait at 0x7ec0f4a74790>
err: Traceback (most recent call last):
err: File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/context.py", line 2833, in async_wait
err: context().sync_executors()
err: File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/context.py", line 749, in sync_executors
err: pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
err: tensorflow.python.framework.errors_impl.InternalError: 8 root error(s) found.
err: (0) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}]]
err: Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
err: [[RemoteCall]]
err: [[IteratorGetNextAsOptional]]
err: [[TPUReplicate/_compile/_9902494219978988908/_4/_384]]
err: (1) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}]]
err: Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
err: [[RemoteCall]]
err: [[IteratorGetNextAsOptional]]
err: [[Pad_3/_250]]
err: (2) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}]]
err: Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
err: [[RemoteCall]]
err: [[IteratorGetNextAsOptional]]
err: [[Pad_15/_466]]
err: (3) INTERNAL: {{function_node __inference_train_function_11526}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused
err: Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
err: :UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:50188: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2024-03-22T23:36:04.934268533+00:00"}
err: [[{{node MultiDeviceIteratorGetNextFromShard}}] ... [truncated]
out: g1D)

有人对如何解决这个问题有任何想法吗？以下是我初始化模型和运行训练的方式：

with strategy.scope():
  n_batches_eff = training_dataset.n_batches // strategy.num_replicas_in_sync
  lr_fn = tf.optimizers.schedules.ExponentialDecay(FLAGS.lr, n_batches_eff, FLAGS.decay)
  optimizer = tf.keras.optimizers.Adam(lr_fn)

with strategy.scope():
            model=make_model(#Custom model building function)
            if FLAGS.bayesian:
                 loss=BayesianLoss(n_train_examples=training_dataset.n_batches*training_dataset.batch_size, n_val_examples=validation_dataset.n_batches*validation_dataset.batch_size, TPU=FLAGS.TPU)
                loss.set_model(model)
            else:
                if FLAGS.TPU:
                    loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True, reduction=tf.keras.losses.Reduction.NONE)
                else:
                    loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True)
            model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])




 with strategy.scope():
            val_steps_per_epoch = val_dataset.n_batches // strategy.num_replicas_in_sync
            train_steps_per_epoch = train_dataset.n_batches // strategy.num_replicas_in_sync
            new_history = model.fit(train_dataset.dataset, epochs=epochs,
                                validation_data=val_dataset.dataset,
                                callbacks=[callback], steps_per_epoch=train_steps_per_epoch, validation_steps=val_steps_per_epoch, initial_epoch=last_epoch)

我有一个相当复杂的数据管道，但我认为数据集创建应该全部在 CPU 上完成。然后，我将数据集缓存到内存中以供 TPU 访问：

with self.strategy.scope():
                if self.shuffle:
                    dataset = dataset.shuffle(buffer_size=len(list_IDs))
                dataset.cache()
                global_batchsize = self.batch_size * self.strategy.num_replicas_in_sync
                global_batchsize = tf.cast(global_batchsize, dtype=tf.int64)
                dataset = dataset.batch(global_batchsize)
                dataset = dataset.map(self.normalize_and_onehot, num_parallel_calls=tf.data.experimental.AUTOTUNE)
                dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
                dataset = self.strategy.experimental_distribute_dataset(dataset)

Answer 1

问题在于使用生成器函数生成数据集。现在使用 from_tensor_slices 问题就消失了。

TPU 连接问题训练 TF 模型 Google Colab

问题描述投票：0回答：1

1个回答

最新问题

TPU 连接问题训练 TF 模型 Google Colab

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1