Tensorflow 中的内存泄漏

Question

我正在循环中创建和丢弃大量神经网络模型。不知何故，废弃的模型会在内存中累积并最终导致内存不足崩溃。

命令

tf.keras.backend.clear_session()

应该避免旧模型造成的混乱（文档）。但是，该命令对我不起作用。

张量流版本：2.8.0 Keras 版本：2.8.0

重现的最小示例：

import tensorflow as tf
from tensorflow import keras

# Use GPU
physical_devices = tf.config.list_physical_devices("GPU")
print("physical devices: ", physical_devices)
# Don't crash if something else is also using the GPU
tf.config.experimental.set_memory_growth(physical_devices[0], True)

def create_nn_model():
    """initialize and return a nn model"""

    Ndim = 100
    N_nodes_L1 = 1000
    N_nodes_L2 = 5000

    # construct model
    x_input = keras.Input(shape=[Ndim])
    L1 = keras.layers.Dense(N_nodes_L1, input_shape = [Ndim],
                                     activation="swish")(x_input)
    L2 = keras.layers.Dense(N_nodes_L2, input_shape=[N_nodes_L1],
                                     activation="swish")(L1)
    output = keras.layers.Dense(1, input_shape=[N_nodes_L2],
                                   activation = "linear")(L2)
    model = keras.Model(inputs=[x_input],
                        outputs = [output])
    # plot model
    keras.utils.plot_model(model, "model.png", show_shapes=True)
    return(model)


for ii in range(1_000):

    print(f"Training model {ii+1} of 1,000")
    nn_model = create_nn_model()

    tf.keras.backend.clear_session()

错误信息：

2023-06-21 18:35:33.887623: W tensorflow/core/common_runtime/bfc_allocator.cc:462] Allocator (GPU_0_bfc) ran out of memory trying to allocate 19.07MiB (rounded to 20000000)requested by op AddV2
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
Current allocation summary follows.
2023-06-21 18:35:33.887904: I tensorflow/core/common_runtime/bfc_allocator.cc:1010] BFCAllocator dump for GPU_0_bfc
2023-06-21 18:35:33.888881: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (256):       Total Chunks: 66, Chunks in use: 56. 16.5KiB allocated for chunks. 14.0KiB in use in bin. 228B client-requested in use in bin.
2023-06-21 18:35:33.889243: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (512):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-06-21 18:35:33.889709: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (1024):      Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2023-06-21 18:35:33.889982: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (2048):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-06-21 18:35:33.890294: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (4096):      Total Chunks: 199, Chunks in use: 198. 920.0KiB allocated for chunks. 912.2KiB in use in bin. 773.4KiB client-requested in use in bin.
2023-06-21 18:35:33.890487: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (8192):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.

...

2023-06-21 18:35:34.110374: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 2 Chunks of size 33554432 totalling 64.00MiB
2023-06-21 18:35:34.110547: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 33741824 totalling 32.18MiB
2023-06-21 18:35:34.110718: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 34217728 totalling 32.63MiB
2023-06-21 18:35:34.111092: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 36870912 totalling 35.16MiB
2023-06-21 18:35:34.111263: I tensorflow/core/common_runtime/bfc_allocator.cc:1078] Sum Total of in-use chunks: 3.87GiB
2023-06-21 18:35:34.111450: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] total_region_allocated_bytes_: 4162256896 memory_limit_: 4162256896 available bytes: 0 curr_region_allocation_bytes_: 4294967296
2023-06-21 18:35:34.111627: I tensorflow/core/common_runtime/bfc_allocator.cc:1086] Stats:
Limit:                      4162256896
InUse:                      4160154112
MaxInUse:                   4160154368
NumAllocs:                        2972
MaxAllocSize:                 36870912
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2023-06-21 18:35:34.111869: W tensorflow/core/common_runtime/bfc_allocator.cc:474] ****************************************************************************************************
2023-06-21 18:35:34.112024: W tensorflow/core/framework/op_kernel.cc:1733] RESOURCE_EXHAUSTED: failed to allocate memory

显然，多进程可以作为一种可能的解决方法。然而，它看起来很复杂，如果可能的话，我更喜欢更简单的解决方案。

Answer 1

自 2021 年 7 月以来，内存泄漏就是 GitHub 上的一个已知问题，到现在已经两年了。

可能的解决方案：

等待问题得到修复。它已在 TensorFlow 2.12 中部分但未完全修复。
降级到 TensorFlow 2.5，没有此问题。
定期保存所有内容，重新启动程序，加载所有内容，然后恢复训练。
切换到 PyTorch。
切换到 Julia 并使用 Flux。

编辑：作为解决方案 3 的示例，创建一个外部脚本。该脚本将调用您的内存泄漏程序（下面称为 main_leaky_script.py），并在崩溃时自动重新启动它，而无需您手动干预。从安装了 TensorFlow 的环境启动外部脚本：

import os
import time

exit_code = 1
i = 0
MAX_RUNS = 100

while exit_code and i < MAX_RUNS:
    i+=1
    print(f"Step {i} of maximum {MAX_RUNS}")
    exit_code = os.system("python3 main_leaky_script.py")
    print("Exit code: ", exit_code)
    time.sleep(1)

print(f"Ended on run number {i} ({i+bool(exit_code)-1} crashes)")

在 main_leaky_script.py 的开头，尝试加载模型并创建它（如果尚不存在）：

try:
    model = tf.keras.models.load_model("model.keras")
except:
    # initialize your model from scratch
    model = create_nn_model()

# Compile model
# Do training inside a loop, stopping and saving your model every few epochs. 
# To save, use the command:
model.save("model.keras")

设置检查点并保存和加载权重可能比整个模型更好。请参阅官方文档了解更多信息。

Tensorflow 中的内存泄漏

问题描述投票：0回答：1

1个回答

最新问题

Tensorflow 中的内存泄漏

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1