我正在使用h5py
读入预处理数据以馈入卷积神经网络。所有输入图像的尺寸均相同。我正在使用以下读/写语法:
# Read
with h5py.File() as x:
numpy_array = x['key'][:]
# Write
x = h5py.File(data_path)
x.create_dataset('key', data = numpy_array)
x.close()
我的数据集有大约500个样本。出于某种奇怪的原因,训练的第一个N
迭代次数(N似乎有所不同),在每次迭代中,我读取hdf5
文件时,都会看到以下执行时间
加载数据时间:0.10571813583374023
但是,突然之间,在N+1
迭代中,加载数据开始花费更多的时间。
加载数据时间:1.5208463668823242
任何想法可能是什么原因造成的?一旦发生性能变化,就永远不会回头。鉴于所有文件的大小相同,这对我来说没有任何意义。即使我仔细阅读了所有示例并重新开始,最初读取得很快的文件也要花很长时间加载。
编辑:这是使用h5py.File() as x
语法的精确代码示例,以及示例输出行为。
def train(points_h5f, img_h5f, labels_h5f):
'''
Populating dictionaries used by external libraries later on in code
'''
for i in range(num_samples):
a = time.time()
# Load points
points = {}
points['dict_key'] = {'points':points_h5f['points/point_{}'.format(i)][:]}
# Load images
images = {}
for cam in camera_sensors:
prop_d = {}
for prop in camera_prop:
prop_d[prop] = img_h5f['{}/{}/{}_{}'.format(cam,prop,prop,i)][:]
images[cam] = prop_d
# Load labels
labels = []
for j in range(num_labels):
labels.append(labels_h5f['label_groups/label_{}_{}'.format(i,j)][:])
b = time.time()
print('Iteration: {} \nload data time: {}\n'.format(i, b-a))
with h5py.File('path/all_points.hdf5', 'r') as points_h5f:
with h5py.File('path/all_images.hdf5', 'r') as img_h5f:
with h5py.File('path/all_labels.hdf5', 'r') as labels_h5f:
train(points_h5f, img_h5f, labels_h5f)
> output
>Iteration: 0
load data time: 0.09873628616333008
Iteration: 1
load data time: 0.09973263740539551
Iteration: 2
load data time: 0.09973430633544922
Iteration: 3
load data time: 0.1057431697845459
.
.
.
Iteration: 125
load data time: 0.09771347045898438
Iteration: 126
load data time: 0.24407505989074707
Iteration: 127
load data time: 1.0163114070892334
Iteration: 128
load data time: 1.0114076137542725
Iteration: 129
load data time: 1.0284936428070068
Iteration: 130
load data time: 1.1249558925628662
Iteration: 131
load data time: 1.025432825088501
.
.
.
Iteration: 500
load data time: 1.114523423498758
这里有两个简单的测试,它们创建一个具有1000个shape(200,200,3)浮点数组的HDF5文件。
- [使用方法1,我始终获得0.17-0.20秒/ 100个数据集。
- [使用方法2,我始终获得0.23-0.25秒/ 100个数据集。
时代正在写入HDD。期望在SDD上获得更快的结果。方法2稍慢一些,但不如您所见。
方法1:使用with -- as:
]打开HDF5一次>import h5py
import numpy as np
import time
num = 1000
with h5py.File('SO_59555208_1.h5', 'w') as h5f:
start = time.clock()
for cnt in range(num):
if cnt % (num/10) == 0 and cnt > 1:
print ('dataset count: {}/{}'.format(cnt, num) )
print ('Elapsed time =', (time.clock() - start) )
start = time.clock()
ds_name = 'key_' + str(cnt)
# Create sample image data and add to a dataset
img_data = np.random.rand(200*200*3,1).reshape(200,200,3)
dset = h5f.create_dataset(ds_name, data=img_data )
print ('dataset count: {}/{}'.format(cnt, num) )
print ('Elapsed time =', (time.clock() - start) )
print ('DONE')
方法2:打开/关闭HDF5以添加每个数据集
import h5py
import numpy as np
import time
num = 1000
start = time.clock()
for cnt in range(num):
if cnt % (num/10) == 0 and cnt > 1:
print ('dataset count: {}/{}'.format(cnt, num) )
print ('Elapsed time =', (time.clock() - start) )
start = time.clock()
h5f = h5py.File('SO_59555208_m.h5', 'a')
ds_name = 'key_' + str(cnt)
# Create sample image data and add to a dataset
img_data = np.random.rand(200*200*3,1).reshape(200,200,3)
dset = h5f.create_dataset(ds_name, data=img_data )
h5f.close()
print ('dataset count: {}/{}'.format(cnt, num) )
print ('Elapsed time =', (time.clock() - start) )
print ('DONE')