简介:
我目前正在开发一个使用 PyTorch 的应用程序,并且遇到了与内存管理相关的有趣行为。具体来说,当我加载模型并将其从 CPU 移动到 GPU 时,只有部分模型被传输到 GPU(这看起来很正常)。但是,当我将模型从 GPU 移回 CPU 时,整个模型大小也会移回,导致 RAM 使用量增加。即使显式调用垃圾收集器或使用 torch 函数释放内存似乎也不会释放 RAM(仅释放 GPU 内存)。
重现问题: 下面是演示此问题的代码片段:
import gc
import torch
import torch.nn as nn
from memory_profiler import profile
INT_ITERATION = 5
class LargeNet(nn.Module):
def __init__(self):
super(LargeNet, self).__init__()
self.fc1 = nn.Linear(10000, 5000)
self.fc2 = nn.Linear(5000, 1000)
self.fc3 = nn.Linear(1000, 500)
self.fc4 = nn.Linear(500, 100)
self.fc5 = nn.Linear(100, 10)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = torch.relu(self.fc3(x))
x = torch.relu(self.fc4(x))
x = self.fc5(x)
return x
@profile
def run_test():
# Create the network and move it to the GPU
model = LargeNet()
model = model.to('cuda')
model = model.to('cpu')
del model
gc.collect()
torch.cuda.empty_cache()
if __name__ == "__main__":
print("PyTorch version:", torch.__version__)
if torch.cuda.is_available():
for i in range(INT_ITERATION):
print(f'******* Iteration num: {i+1} *********** \n')
run_test()
input("Press Enter to continue...")
else:
print('CUDA is not available')
要运行代码并重现问题,您需要在 Python 环境中安装
torch
和 memory_profiler
软件包。
输出和观察: 在我的带有 Torch 2.2.2 和 CUDA 12.1 的 Ubuntu 20.04 机器上(我在带有 Torch 2.1.0 和 CUDA 12.1 的 Windows PC 上遇到了同样的问题),我观察到以下行为:
******* Iteration num: 1 ***********
Filename: test_torch_memory_leak.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
26 332.7 MiB 332.7 MiB 1 @profile
27 def run_test():
28 # Create the network and move it to the GPU
29 546.9 MiB 214.1 MiB 1 model = LargeNet()
30 451.2 MiB -95.6 MiB 1 model = model.to('cuda')
31
32 662.9 MiB 211.7 MiB 1 model = model.to('cpu')
33 472.4 MiB -190.5 MiB 1 del model
34
35 472.4 MiB 0.0 MiB 1 gc.collect()
36 472.4 MiB 0.0 MiB 1 torch.cuda.empty_cache()
******* Iteration num: 2 ***********
Filename: test_torch_memory_leak.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
26 472.4 MiB 472.4 MiB 1 @profile
27 def run_test():
28 # Create the network and move it to the GPU
29 682.0 MiB 209.6 MiB 1 model = LargeNet()
30 491.5 MiB -190.5 MiB 1 model = model.to('cuda')
31
32 682.0 MiB 190.5 MiB 1 model = model.to('cpu')
33 491.5 MiB -190.5 MiB 1 del model
34
35 491.5 MiB 0.0 MiB 1 gc.collect()
36 491.5 MiB 0.0 MiB 1 torch.cuda.empty_cache()
******* Iteration num: 3 ***********
Filename: test_torch_memory_leak.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
26 491.5 MiB 491.5 MiB 1 @profile
27 def run_test():
28 # Create the network and move it to the GPU
29 701.1 MiB 209.6 MiB 1 model = LargeNet()
30 510.6 MiB -190.5 MiB 1 model = model.to('cuda')
31
32 720.2 MiB 209.6 MiB 1 model = model.to('cpu')
33 529.6 MiB -190.5 MiB 1 del model
34
35 529.6 MiB 0.0 MiB 1 gc.collect()
36 529.6 MiB 0.0 MiB 1 torch.cuda.empty_cache()
******* Iteration num: 4 ***********
Filename: test_torch_memory_leak.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
26 529.6 MiB 529.6 MiB 1 @profile
27 def run_test():
28 # Create the network and move it to the GPU
29 720.2 MiB 190.5 MiB 1 model = LargeNet()
30 529.7 MiB -190.5 MiB 1 model = model.to('cuda')
31
32 682.4 MiB 152.7 MiB 1 model = model.to('cpu')
33 491.6 MiB -190.7 MiB 1 del model
34
35 491.6 MiB 0.0 MiB 1 gc.collect()
36 491.6 MiB 0.0 MiB 1 torch.cuda.empty_cache()
******* Iteration num: 5 ***********
Filename: test_torch_memory_leak.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
26 491.6 MiB 491.6 MiB 1 @profile
27 def run_test():
28 # Create the network and move it to the GPU
29 701.2 MiB 209.6 MiB 1 model = LargeNet()
30 510.6 MiB -190.6 MiB 1 model = model.to('cuda')
31
32 720.2 MiB 209.6 MiB 1 model = model.to('cpu')
33 529.7 MiB -190.5 MiB 1 del model
34
35 529.7 MiB 0.0 MiB 1 gc.collect()
36 529.7 MiB 0.0 MiB 1 torch.cuda.empty_cache()
Press Enter to continue...
有趣的是,经过3到4次迭代后,内存使用量趋于稳定,没有进一步增加。然而,这种初始行为特别烦人,因为第一次加载模型时,与后续迭代相比,我可以使用更少的内存来使用它。
问题:
注意到同样的问题。希望有人能帮忙