当我尝试将 model.predict() 与 Keras 顺序模型一起使用时，出现意外的 Tensorflow ResourceExhaustedError

Question

我正在使用

Python 3.9

，并且我已将

Tensorflow 2.10

与

CUDA Toolkit 11.2

和

cuDNN 8.2

一起安装，因为这是

Windows 10

上本机支持的最后一个配置。

我正在使用配备 8Gb VRAM 的 NVIDIA GeForce RTX 2070 SUPER 进行训练，我的 PC 上有 64Gb RAM。

我使用

Keras

创建了一个顺序模型来预测 POS 标签。我使用相同的模型格式来训练多种不同语言的文本模型。模型都训练得很好，当我运行

model.evaluate(test_data)

时，它们都会产生一个分数。同样，当我运行

model.predict(test_data)

时，大多数模型都会产生预期结果，但对于一种语言，有一个模型的行为不同。

这个模型的训练方式与所有其他模型相同，所以我认为应该没有区别。当我使用这个模型运行

model.predict(test_data)

时，起初它似乎工作正常。它开始将模型应用到数据集：

  6/152 [=>............................] - ETA: 19s

它甚至似乎成功完成了这一步，尽管它从未产生任何结果：

152/152 [==============================] - 20s 126ms/step

不幸的是，此时它挂起并产生以下回溯：

2024-01-05 23:08:38.977923: W tensorflow/core/common_runtime/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.61GiB (rounded to 2804106240)requested by op ConcatV2
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
...
...
...
2024-01-05 23:08:38.998922: I tensorflow/core/common_runtime/bfc_allocator.cc:1101] Sum Total of in-use chunks: 4.04GiB
2024-01-05 23:08:38.998977: I tensorflow/core/common_runtime/bfc_allocator.cc:1103] total_region_allocated_bytes_: 6263144448 memory_limit_: 6263144448 available bytes: 0 curr_region_allocation_bytes_: 8589934592
2024-01-05 23:08:38.999071: I tensorflow/core/common_runtime/bfc_allocator.cc:1109] Stats: 
Limit:                      6263144448
InUse:                      4335309312
MaxInUse:                   4520417536
NumAllocs:                        1293
MaxAllocSize:                536870912
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2024-01-05 23:08:38.999241: W tensorflow/core/common_runtime/bfc_allocator.cc:491] ****************x*****************************************************______________________________
2024-01-05 23:08:38.999336: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at concat_op.cc:158 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[38688,18120] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "C:\Users\admd9\PycharmProjects\codalab-sigtyp2024\generate_results.py", line 131, in <module>
    predictions = task_model.predict(test_gen)
  File "C:\Users\admd9\anaconda3\envs\tf_codalab_sharedtask\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "C:\Users\admd9\anaconda3\envs\tf_codalab_sharedtask\lib\site-packages\tensorflow\python\framework\ops.py", line 7209, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.ResourceExhaustedError: {{function_node __wrapped__ConcatV2_N_152_device_/job:localhost/replica:0/task:0/device:GPU:0}} OOM when allocating tensor with shape[38688,18120] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:ConcatV2] name: concat

我无法弄清楚为什么它只发生在这个模型上，或者为什么当它适用于所有其他模型时内存分配会出现问题。它似乎也没有尝试使用大量内存。那么为什么我会收到此错误消息？还有，我该如何解决它？

我尝试过设置内存增长，但没有用：

physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)

我还减少了批量大小。这没有帮助。我什至回去重新训练模型，以防模型本身出现问题。新型号仍然存在同样的问题。作为最后一个选择，我尝试将测试集分成更小的部分，在每个部分上运行

model.predict(test_data)

，然后重新组合每个部分的结果。它有时会成功预测第一个除法，但总是耗尽内存，并在第二个除法时给我同样的错误。

有什么我可以做的吗？

Answer 1

如果这对遇到同样问题的其他人有用，这就是我克服它的方法。如果使用 GPU 进行预测时出现错误，我会改用 CPU。这样 GPU 内存就不会被耗尽，因为系统内存正在被利用而不是它。

import tensorflow as tf

# Make predictions with model
# Try to make predictions using GPU first
try:
    with tf.device('/gpu:0'):
        predictions = model.predict(test_data)
# If the GPU prediction fails (eg. due to memory error) attempt prediction using CPU instead
except:
    with tf.device('/cpu:0'):
        predictions = model.predict(test_data)

这对我有用，但使用 CPU 可能会慢很多，特别是在测试集很大的情况下。我也不一定想通过切换到 CPU 来排除所有错误。我想更具体地说明

except

语句，但我收到的错误消息似乎是离散张量流错误：

tensorflow.python.framework.errors_impl.ResourceExhaustedError

。我不知道如何除此之外。

我不会接受这个答案，因为我认为这不是一个很好的解决方案。如果有人可以提供不需要使用 CPU 的更好的解决方案，我仍然会很感激，但这是我能想到的最好的解决方法。

当我尝试将 model.predict() 与 Keras 顺序模型一起使用时，出现意外的 Tensorflow ResourceExhaustedError

问题描述投票：0回答：1

1个回答

最新问题

当我尝试将 model.predict() 与 Keras 顺序模型一起使用时，出现意外的 Tensorflow ResourceExhaustedError

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1