h5py读取时间在读取速度上有随机且剧烈的波动

Question

我正在使用h5py读入预处理数据以馈入卷积神经网络。所有输入图像的尺寸均相同。我正在使用以下读/写语法：

# Read
with h5py.File() as x:
    numpy_array = x['key'][:]

# Write
x = h5py.File(data_path)
x.create_dataset('key', data = numpy_array)
x.close()

我的数据集有大约500个样本。出于某种奇怪的原因，训练的第一个N迭代次数（N似乎有所不同），在每次迭代中，我读取hdf5文件时，都会看到以下执行时间

加载数据时间：0.10571813583374023

但是，突然之间，在N+1迭代中，加载数据开始花费更多的时间。

加载数据时间：1.5208463668823242

任何想法可能是什么原因造成的？一旦发生性能变化，就永远不会回头。鉴于所有文件的大小相同，这对我来说没有任何意义。即使我仔细阅读了所有示例并重新开始，最初读取得很快的文件也要花很长时间加载。

编辑：这是使用h5py.File() as x语法的精确代码示例，以及示例输出行为。


def train(points_h5f, img_h5f, labels_h5f):
    '''
    Populating dictionaries used by external libraries later on in code
    '''

    for i in range(num_samples):
        a = time.time()

        # Load points
        points = {}
        points['dict_key'] = {'points':points_h5f['points/point_{}'.format(i)][:]}

        # Load images
        images = {}
        for cam in camera_sensors:
            prop_d = {}
            for prop in camera_prop:
                prop_d[prop] = img_h5f['{}/{}/{}_{}'.format(cam,prop,prop,i)][:]                  
            images[cam] = prop_d

        # Load labels
        labels = []
        for j in range(num_labels):
            labels.append(labels_h5f['label_groups/label_{}_{}'.format(i,j)][:])

        b = time.time()

        print('Iteration: {} \nload data time: {}\n'.format(i, b-a))

with h5py.File('path/all_points.hdf5', 'r') as points_h5f:
        with h5py.File('path/all_images.hdf5', 'r') as img_h5f:
            with h5py.File('path/all_labels.hdf5', 'r') as labels_h5f:
                train(points_h5f, img_h5f, labels_h5f)


> output

>Iteration: 0
load data time: 0.09873628616333008

Iteration: 1
load data time: 0.09973263740539551

Iteration: 2
load data time: 0.09973430633544922

Iteration: 3
load data time: 0.1057431697845459
.
.
.

Iteration: 125
load data time: 0.09771347045898438

Iteration: 126
load data time: 0.24407505989074707

Iteration: 127
load data time: 1.0163114070892334

Iteration: 128
load data time: 1.0114076137542725

Iteration: 129
load data time: 1.0284936428070068

Iteration: 130
load data time: 1.1249558925628662

Iteration: 131
load data time: 1.025432825088501

.
.
. 

Iteration: 500
load data time: 1.114523423498758

Answer 1

这里有两个简单的测试，它们创建一个具有1000个shape（200,200,3）浮点数组的HDF5文件。

[使用方法1，我始终获得0.17-0.20秒/ 100个数据集。
[使用方法2，我始终获得0.23-0.25秒/ 100个数据集。

时代正在写入HDD。期望在SDD上获得更快的结果。方法2稍慢一些，但不如您所见。

方法1：使用with -- as:]打开HDF5一次>

import h5py
import numpy as np
import time

num = 1000

with h5py.File('SO_59555208_1.h5', 'w') as h5f:

    start = time.clock()
    for cnt in range(num):
        if cnt % (num/10) == 0 and cnt > 1:
            print ('dataset count: {}/{}'.format(cnt, num) )
            print ('Elapsed time =', (time.clock() - start) ) 
            start = time.clock()

        ds_name = 'key_' + str(cnt) 
        # Create sample image data and add to a dataset
        img_data = np.random.rand(200*200*3,1).reshape(200,200,3)
        dset = h5f.create_dataset(ds_name, data=img_data )

print ('dataset count: {}/{}'.format(cnt, num) )
print ('Elapsed time =', (time.clock() - start) ) 
print ('DONE')

方法2：打开/关闭HDF5以添加每个数据集

import h5py
import numpy as np
import time

num = 1000

start = time.clock()
for cnt in range(num):
    if cnt % (num/10) == 0 and cnt > 1:
        print ('dataset count: {}/{}'.format(cnt, num) )
        print ('Elapsed time =', (time.clock() - start) ) 
        start = time.clock()
    h5f = h5py.File('SO_59555208_m.h5', 'a')     
    ds_name = 'key_' + str(cnt) 
    # Create sample image data and add to a dataset
    img_data = np.random.rand(200*200*3,1).reshape(200,200,3)
    dset = h5f.create_dataset(ds_name, data=img_data )
    h5f.close()

print ('dataset count: {}/{}'.format(cnt, num) )
print ('Elapsed time =', (time.clock() - start) ) 
print ('DONE')

h5py读取时间在读取速度上有随机且剧烈的波动

问题描述投票：4回答：1

1个回答

最新问题

h5py读取时间在读取速度上有随机且剧烈的波动

问题描述 投票：4回答：1

1个回答

最新问题

问题描述投票：4回答：1