大家好,我尝试使用 h5py 将数据附加到数据集,但它似乎不起作用,我试图找出为什么
numpy_arr
是生成结构化 numpy 数组的生成器,看起来像这样:
dt_vlstr = h5py.string_dtype(encoding='utf-8')
dt_vlstr_arr = h5py.vlen_dtype(dt_vlstr)
dt_int = np.dtype('i4')
constructor = {
'names':['ark','dates','events','iso639','views','dublincore','catch_word','colors', 'bw'],
'formats': [
dt_vlstr, #ark
dt_vlstr_arr, #dates
dt_vlstr_arr, #events
dt_vlstr_arr, #iso639
dt_int, #lecture
dt_vlstr, #dublincore
dt_vlstr_arr, #catch_word
dt_int, #colors
dt_int, #bw
]}
compound = np.dtype(constructor)
def mapping2numpy(generators):
for i in (generators):
numpy_arr = np.array([(
i['ark'],
i['dates'].astype(dt_vlstr),
i['events'].astype(dt_vlstr),
i['iso639'].astype(dt_vlstr),
i['views'],
i['dublincore'],
i['catch_word'].astype(dt_vlstr),
i['colors'],
i["bw"])],dtype=compound)
yield numpy_arr
numpy_arr = mapping2numpy(data)
with h5py.File('file.h5', 'w') as h5f:
group = h5f.create_group('metadata')
dataset = group.create_dataset('records', (1,1), maxshape=(None,1),
compression="lzf",
dtype=compound,
fletcher32=True,
chunks=(1,1))
with h5py.File('file.h5', 'a') as h5f:
dset = h5f['metadata/records']
for data in numpy_arr:
dset.resize( (dset.shape[0]+1, 1) )
dset[-1,:] = data
没有数据就很难完全诊断您的过程。也就是说,我看到了一些可能导致问题的小事情。这是我注意到的:
shape=(1,1)
和
maxshape=(None,1)
。这有两个问题:
A。对于复合数据集,shape
(和maxshape
)应该是
单元组,例如:shape=(1,)
和 maxshape=(None,)
shape=(0,)
。chunks
需要匹配shape
,但不要设置chunks=(1,)
。我把它留了下来,让 h5py
设置默认块大小。块大小控制 I/O 性能,这是您可以请求的最小可能块大小。它可能会造成严重的 I/O 性能瓶颈。这里是一个使用你的生成器的例子。我将您的示例简化为只有 3 个字段(每个 1 个带有
dtype=dt_vlstr, dt_vlstr_arr, and int
)。
def mapping2numpy(generators):
for i in (generators):
numpy_arr = np.array([(
i['ark'],
i['dates'].astype(dt_vlstr),
i['views'])],
dtype=compound)
yield numpy_arr
dt_vlstr = h5py.string_dtype(encoding='utf-8')
dt_vlstr_arr = h5py.vlen_dtype(dt_vlstr)
dt_int = np.dtype('i4')
constructor = {'names':['ark','dates','views'],
'formats': [dt_vlstr, #ark
dt_vlstr_arr, #dates
dt_int, #lecture / views
]}
compound = np.dtype(constructor)
# using generator to convert data:
data_arr1 = np.empty((3,), dtype=compound)
data_arr1[:]['ark'] = ['row_0', 'row_1_long', 'row_2_longest']
data_arr1[:]['dates'] = [np.array([['a', 'bbb'],['cc', 'ddd']]),
np.array([['i', 'jjj'],['kk', 'lll']]),
np.array([['w', 'xxx'],['yy', 'zzz']]) ]
data_arr1[:]['views'] = [i for i in range(1,4)]
numpy_arr = mapping2numpy(data_arr1)
with h5py.File('file.h5', 'w') as h5f:
group = h5f.create_group('metadata')
group.create_dataset('records1', (0,), maxshape=(None,), dtype=compound)
with h5py.File('file.h5', 'a') as h5f:
dset = h5f['metadata/records1']
for i, data in enumerate(numpy_arr):
dset.resize((dset.shape[0]+1,))
dset[i] = data
print(dset[:])
print(dset.chunks)
我不知道你为什么要写生成器。我怀疑您需要将某些数据对象类型转换为可变长度字符串和数组类型。正如我在评论中提到的,逐行加载数据集是最慢的方法。如果您只加载 1000 行,这无关紧要。但是,如果您需要加载很多行 (10e6),该过程将非常缓慢。详情请看这个问答:pytables 写的比 h5py 快多了。为什么? 忽略 PyTables 部分,关注频繁写入少量行时的性能问题。
这里是我修改过的示例,显示直接从 recarray 写入,逐行(到
records2
)和一次全部(到 records3
)。我强烈推荐最后一种方法(但一次超过 3 行)。它从上面的代码继续。
# loading data directly from numpy recarray:
data_arr2 = np.empty((3,), dtype=compound)
data_arr2[:]['ark'] = np.array(['row_0', 'row_1_long', 'row_2_longest'], dtype=dt_vlstr)
data_arr2[:]['dates'] = [np.array([['a', 'bbb'],['cc', 'ddd']]).astype(dt_vlstr),
np.array([['i', 'jjj'],['kk', 'lll']]).astype(dt_vlstr),
np.array([['w', 'xxx'],['yy', 'zzz']]).astype(dt_vlstr) ]
data_arr2[:]['views'] = [i for i in range(1,4)]
# loading row-by-row -- NOT recommended
with h5py.File('file.h5', 'a') as h5f:
h5f['metadata'].create_dataset('records2', (0,), maxshape=(None,), dtype=compound)
dset = h5f['metadata/records2']
for i, data in enumerate(data_arr2):
dset.resize((dset.shape[0]+1,))
dset[i] = data
print(dset[:])
print(dset.chunks)
# loading all at once -- preferred method
with h5py.File('file.h5', 'a') as h5f:
h5f['metadata'].create_dataset('records3', data=data_arr2, maxshape=(None,))
print(h5f['metadata/records3'][:])
print(h5f['metadata/records3'].chunks)