TensorFlow 2.1使用TPUEstimator。RuntimeError: 所有从TPU输出的张量都应该保留批量大小维度,但得到了标量张量。

问题描述 投票:2回答:1

我刚刚将一个现有的项目从 TF 1.14 转换到 TF 2.1,其中使用了 TPUEstimator API。转换后,在本地进行了测试(即 use_tpu=False)运行成功。然而,当我在 Google Cloud TPU 上运行时,我得到了错误(即 use_tpu=True).

注意:这是在 AdaNet AutoML 框架 (v0.8.0) 的上下文中,尽管我怀疑这可能是一个与 TPUEstimator 相关的一般错误,因为这些错误似乎起源于下面回溯中看到的 tpu_estimator.py 和 error_handling.py 脚本。

  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3032, in train
    rendezvous.record_error('training_loop', sys.exc_info())
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 81, in record_error
    if value and value.op and value.op.type == _CHECK_NUMERIC_OP_NAME:
  AttributeError: 'RuntimeError' object has no attribute 'op'

  During handling of the above exception, another exception occurred:  

  File "workspace/trainer/train.py", line 331, in <module>
    main(args=parsed_args)
  File "workspace/trainer/train.py", line 177, in main
    run_config=run_config)
  File "workspace/trainer/train.py", line 68, in run_experiment
    estimator.train(input_fn=train_input_fn, max_steps=total_train_steps)
  File "/usr/local/lib/python3.6/site-packages/adanet/core/estimator.py", line 853, in train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 143, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python3.6/site-packages/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 374, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1164, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1194, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2857, in _call_model_fn
    config)
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1152, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3186, in _model_fn
    host_ops = host_call.create_tpu_hostcall()
  File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2226, in create_tpu_hostcall
    'dimension, but got scalar {}'.format(dequeue_ops[i][0]))
RuntimeError: All tensors outfed from TPU should preserve batch size dimension, but got scalar Tensor("OutfeedDequeueTuple:1", shape=(), dtype=int64, device=/job:tpu_worker/task:0/device:CPU:0)'

之前使用 TF 1.14 的项目版本在本机和 TPU 上使用 TPUEstimator 运行都没有问题。 当使用 TPUEstimator API 转换到 TF 2.1 时,我是否可能缺少一些明显的东西?

python-3.x tensorflow2.0 tensorflow-estimator tpu
1个回答
0
投票

您是否应用了以下内容。

dataset = ...
dataset = dataset.apply(tf.contrib.data.batch_and_drop_remainder(batch_size))

这可能会从文件中丢弃最后几个样本,以确保每个批次都有一个静态的batch_size形状,这是在TPU上训练时需要的。

© www.soinside.com 2019 - 2024. All rights reserved.