HDF5 读取数据集的性能

Question

我有数百万张图像，我想尽快阅读它们，但我希望能够以随机顺序阅读它们。我将它们存储在 HDF5 文件中，但我发现如果按随机顺序访问，则读取时间会大大增加，而不是像代码和配置文件时间线所示：

    with h5py.File("/slowdata/caid_2024/GenImage_compressed.h5") as hf:
        keys = list(hf.keys())
        from random import shuffle

        sleep(0.1)

        for i, key in enumerate(keys):
            if i == 16:
                break
            np.array(hf[key])
        
        # Random order now
        shuffle(keys)
        sleep(0.1)
        for i, key in enumerate(keys):
            if i == 16:
                break
            np.array(hf[key])
        sleep(0.1)

Answer 1

我无法解释为什么您在分析器中看到如此大的差异。我不认为 I/O 性能会因读取数据集的顺序而出现显着差异。为了证实这一点，我构建了一个简单的测试来模仿您的代码，并且发现变化很小。我的代码在最后 HDF5 数据集不是按顺序访问的，因此按创建顺序读取没有任何优势。此外，数据集名称（也称为键）按字母顺序返回，而不是创建顺序。换句话说，数据集

Alpha

在

Zulu

之前返回，即使

Zulu

是先创建的。所以，你无法真正控制创建顺序的阅读。

也就是说，

h5py/HDF5

I/O 性能由许多因素控制：

处理非常大的数据集时最重要的因素是分块 I/O（如果已启用，并且块大小设置适当）。

另一个因素是读取的数据块的大小和数量。读取大量小数据块比读取少量大数据块要慢。
压缩也会降低 I/O 性能。

np.array(hf[key])

从数据集中创建 NumPy 数组。首选方法是

hf[key][()]

。我对这两种方法进行了基准测试。首选方式是

稍微快一点

（6-10%），但显示出相同的随机行为与字母行为。最后是我的测试代码的性能统计。请注意，随机读取顺序实际上比按字母顺序读取要快一些。这可能是由于缓存 I/O 造成的。我需要有关您的文件和数据集的更多详细信息，以诊断您看到差异的原因。

dataset shape = 256, 256, 10_000 H5 size: 39 GB WITHOUT CHUNKED I/O Time to create file: 373.49 sec read peformance: As written using np.array(hfr[key]): Time to read datasets alphabetically: 407.66 sec Time to read datasets randomly: 355.97 sec Modified to use hfr[key][()]: Time to read datasets alphabetically: 360.67 sec Time to read datasets randomly: 333.69 sec

以下代码。

img_w, img_h, img_cnt = 256, 256, 10_000 ds_cnt = 16 start = time.time() with h5py.File("SO_78500597.h5","w") as hfw: for i in range(ds_cnt): rgb_val = int((255/ds_cnt)*i) hfw.create_dataset(f'image_{i:02}', data=np.full((img_w, img_h, img_cnt), rgb_val, dtype=int)) print(f"Time to create file: {(time.time()-start):.2f} sec") with h5py.File("SO_78500597.h5") as hfr: keys = list(hfr.keys()) print(keys) from random import shuffle start = time.time() for i, key in enumerate(keys): if i == 16: break #np.array(hfr[key]) arr = hfr[key][()] print(f"Time to read datasets alphabetically: {(time.time()-start):.2f} sec") # Random order now shuffle(keys) start = time.time() for i, key in enumerate(keys): if i == 16: break #np.array(hfr[key]) arr = hfr[key][()] print(f"Time to read datasets randomly: {(time.time()-start):.2f} sec")

HDF5 读取数据集的性能

问题描述投票：0回答：1

1个回答

最新问题

HDF5 读取数据集的性能

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1