在 for 循环中加载 Python pickle 会变慢

问题描述 投票:0回答:1

我有一个 25GB 的 numpy 数组字典。 字典如下所示:

  • 668,956 个键值对。
  • 键是字符串。示例键:
    "109c3708-3b0c-4868-a647-b9feb306c886_1"
  • 这些值是形状为
    200x23
    、类型为
    float64
  • 的 numpy 数组

当我在循环中重复使用 pickle 加载数据时,加载时间会变慢(请参阅下面的代码和结果)。这可能是什么原因造成的?

代码:

def load_pickle(file: int) -> dict:
    with open(f"D:/data/batched/{file}.pickle", "rb") as handle:
        return pickle.load(handle)


for i in range(0, 9):
    print(f"\nIteration {i}")
    
    start_time = time.time()
    file = None
    print(f"Unloaded file in {time.time() - start_time:.2f} seconds")

    start_time = time.time()
    file = load_pickle(0)
    print(f"Loaded file in {time.time() - start_time:.2f} seconds")

结果:

Iteration 0
Unloaded file in 0.00 seconds
Loaded file in 18.80 seconds

Iteration 1
Unloaded file in 14.78 seconds
Loaded file in 30.51 seconds

Iteration 2
Unloaded file in 28.67 seconds
Loaded file in 30.21 seconds

Iteration 3
Unloaded file in 35.38 seconds
Loaded file in 40.25 seconds

Iteration 4
Unloaded file in 39.91 seconds
Loaded file in 41.24 seconds

Iteration 5
Unloaded file in 43.25 seconds
Loaded file in 45.57 seconds

Iteration 6
Unloaded file in 46.94 seconds
Loaded file in 48.19 seconds

Iteration 7
Unloaded file in 51.67 seconds
Loaded file in 51.32 seconds

Iteration 8
Unloaded file in 55.25 seconds
Loaded file in 56.11 seconds

备注:

  • 在循环处理期间,RAM 使用量会逐渐下降(我假设取消引用
    file
    变量中的先前数据),然后再次上升。随着时间的推移,卸载和装载零件的速度似乎都会减慢。令我惊讶的是,卸载部分的 RAM 下降得如此之慢。
  • 它增加的总 RAM 使用量保持不变(看起来并不存在内存泄漏)。
  • 我尝试过在循环中包含
    del file
    gc.collect()
    ,但这不会加快任何速度。
  • 如果我将
    return pickle.load(handle)
    更改为
    return handle.read()
    ,则卸载时间始终为 0.45 秒,加载时间始终为 4.85 秒。
  • 我在 Windows 上使用带有 SSD 存储的 Python 3.9.13 (
    Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:51:29) [MSC v.1929 64 bit (AMD64)]
    )。
  • 我有 64GB RAM,但似乎没有最大化。
  • 我为什么要这么做?在 ML 模型的训练过程中,我有 10 个文件,每个文件大小为 25GB。我无法同时将它们全部放入内存中,因此必须在每个时期加载和卸载它们。

有什么想法吗?如果有一种具有相似读取速度并且不会遇到上述问题的替代方案(我不担心压缩),我也愿意放弃使用 pickle。

编辑: 我已经针对不同尺寸的泡菜运行了上述加载和卸载循环。下面的结果显示了速度随时间的相对变化。对于 3 GB 以上的任何内容,卸载时间开始显着增加。

python pickle
1个回答
0
投票

我很想知道这种速度变慢的原因,并且我在类似的任务中也遇到过这种情况。我用 h5py 而不是 pickle 来“解决”它。全部在 Windows 11 上测试,下周将在 Linux 上运行。

我的任务是读取数百万张 numpy 图像并动态获取区域。应用程序规定图像以大约 3000 到 6000 个的批次存储在文件中。

泡菜

imagesDict = {i: np.random.randint(0, 255, (300, 300), dtype=np.uint8) for i in range(4000)}
with open(filePath, 'wb') as file:
    pickle.dump(imagesDict, file, pickle.HIGHEST_PROTOCOL)


thumbs = []
num_image_sets = 0
durations_s_sum = 0.
for i in range(500):
    start_s = time.perf_counter()
    with open(filePath, 'rb') as file:
        imagesDict: dict[int, np.ndarray] = pickle.load(file)
        for key in imagesDict.keys():
            image = imagesDict[key]
            thumb = image[:50, :50].copy()
            thumbs.append(thumb)

    durations_s_sum += (time.perf_counter() - start_s)
    num_image_sets += 1
    if 50 <= num_image_sets:
        memory_info = psutil.Process(os.getpid()).memory_info()
        print(f"{durations_s_sum:4.1f}s for 50 image sets of 4000 images, rss={memory_info.rss/1024/1024:6,.0f}MB, vms={memory_info.vms/1024/1024:6,.0f}MB")
        durations_s_sum = 0.
        num_image_sets = 0

pickle.load() 的速度随着每次迭代而减慢,很快达到不可接受的水平:

10.6s for 50 image sets of 4000 images, rss= 1,575MB, vms= 1,579MB
10.0s for 50 image sets of 4000 images, rss= 2,117MB, vms= 2,134MB
11.5s for 50 image sets of 4000 images, rss= 2,632MB, vms= 2,662MB
14.2s for 50 image sets of 4000 images, rss= 3,150MB, vms= 3,193MB
16.3s for 50 image sets of 4000 images, rss= 3,670MB, vms= 3,726MB
19.1s for 50 image sets of 4000 images, rss= 4,212MB, vms= 4,280MB
22.6s for 50 image sets of 4000 images, rss= 4,746MB, vms= 4,824MB
25.4s for 50 image sets of 4000 images, rss= 5,276MB, vms= 5,367MB
29.2s for 50 image sets of 4000 images, rss= 5,817MB, vms= 5,919MB
35.3s for 50 image sets of 4000 images, rss= 6,360MB, vms= 6,472MB

h5py

with h5py.File(filePath, 'w') as h5:
    for i in range(4000):
        image = np.random.randint(0, 255, (300, 300), dtype=np.uint8)
        h5.create_dataset(str(i), data=image)

thumbs = []
num_image_sets = 0
durations_s_sum = 0.
for i in range(500):
    start_s = time.perf_counter()
    with h5py.File(filePath, "r") as h5:
        for key in h5.keys():
            image = h5[key]
            thumb = image[:50, :50]
            thumbs.append(thumb)

    durations_s_sum += (time.perf_counter() - start_s)
    num_image_sets += 1
    if 50 <= num_image_sets:
        memory_info = psutil.Process(os.getpid()).memory_info()
        print(f"{durations_s_sum:4.1f}s for 50 image sets of 4000 images, rss={memory_info.rss/1024/1024:6,.0f}MB, vms={memory_info.vms/1024/1024:6,.0f}MB")
        durations_s_sum = 0.
        num_image_sets = 0

h5py 较慢,但持续时间几乎恒定在 19 秒左右,因此它在时间上胜出:

20.3s for 50 image sets of 4000 images, rss=   646MB, vms=   637MB
20.3s for 50 image sets of 4000 images, rss= 1,166MB, vms= 1,167MB
19.7s for 50 image sets of 4000 images, rss= 1,685MB, vms= 1,697MB
19.4s for 50 image sets of 4000 images, rss= 2,208MB, vms= 2,229MB
19.7s for 50 image sets of 4000 images, rss= 2,731MB, vms= 2,764MB
19.8s for 50 image sets of 4000 images, rss= 3,255MB, vms= 3,298MB
19.4s for 50 image sets of 4000 images, rss= 3,778MB, vms= 3,832MB
19.9s for 50 image sets of 4000 images, rss= 4,303MB, vms= 4,366MB
19.6s for 50 image sets of 4000 images, rss= 4,826MB, vms= 4,899MB
19.9s for 50 image sets of 4000 images, rss= 5,349MB, vms= 5,434MB

此外,如果内存碎片是问题所在,为什么 h5py 没有表现出类似的行为?

© www.soinside.com 2019 - 2024. All rights reserved.