h5py 将生成器中的数组附加到 h5 文件

问题描述 投票:0回答:1

大家好,我尝试使用 h5py 将数据附加到数据集,但它似乎不起作用,我试图找出为什么

numpy_arr
是生成结构化 numpy 数组的生成器,看起来像这样:


dt_vlstr = h5py.string_dtype(encoding='utf-8')
    dt_vlstr_arr = h5py.vlen_dtype(dt_vlstr) 
    dt_int = np.dtype('i4')

    constructor = {
    'names':['ark','dates','events','iso639','views','dublincore','catch_word','colors', 'bw'],
    'formats': [
    dt_vlstr, #ark
    dt_vlstr_arr, #dates
    dt_vlstr_arr, #events
    dt_vlstr_arr, #iso639
    dt_int, #lecture
    dt_vlstr, #dublincore
    dt_vlstr_arr, #catch_word
    dt_int, #colors
    dt_int, #bw
    ]}

    compound = np.dtype(constructor)

def mapping2numpy(generators):
    for i in (generators):
        numpy_arr = np.array([(
                            i['ark'],
                            i['dates'].astype(dt_vlstr), 
                            i['events'].astype(dt_vlstr), 
                            i['iso639'].astype(dt_vlstr), 
                            i['views'], 
                            i['dublincore'], 
                            i['catch_word'].astype(dt_vlstr),
                            i['colors'],
                            i["bw"])],dtype=compound)

        yield numpy_arr

numpy_arr = mapping2numpy(data)

with h5py.File('file.h5', 'w') as h5f:
    group = h5f.create_group('metadata')
    dataset = group.create_dataset('records', (1,1), maxshape=(None,1), 
                                                                    compression="lzf", 
                                                                    dtype=compound, 
                                                                    fletcher32=True,
                                                                    chunks=(1,1))

with h5py.File('file.h5', 'a') as h5f: 
    dset = h5f['metadata/records']
    for data in numpy_arr:
        dset.resize( (dset.shape[0]+1, 1) )
        dset[-1,:] = data


arrays append hdf5 h5py
1个回答
0
投票

没有数据就很难完全诊断您的过程。也就是说,我看到了一些可能导致问题的小事情。这是我注意到的:

  1. 当你创建数据集时,你设置
    shape=(1,1)
    maxshape=(None,1)
    。这有两个问题: A。对于复合数据集,
    shape
    (和
    maxshape
    )应该是 单元组,例如:
    shape=(1,)
    maxshape=(None,)

    b.您创建了一个空行的数据集,然后在添加数据时添加了一行。所以,第一行总是空的。不是错误,但最好在创建数据集时设置
    shape=(0,)
  2. chunks
    需要匹配
    shape
    ,但不要设置
    chunks=(1,)
    。我把它留了下来,让
    h5py
    设置默认块大小。块大小控制 I/O 性能,这是您可以请求的最小可能块大小。它可能会造成严重的 I/O 性能瓶颈。

这里是一个使用你的生成器的例子。我将您的示例简化为只有 3 个字段(每个 1 个带有

dtype=dt_vlstr, dt_vlstr_arr, and int
)。

def mapping2numpy(generators):
    for i in (generators):
        numpy_arr = np.array([(
                            i['ark'],
                            i['dates'].astype(dt_vlstr), 
                            i['views'])], 
                            dtype=compound)

        yield numpy_arr


dt_vlstr = h5py.string_dtype(encoding='utf-8')
dt_vlstr_arr = h5py.vlen_dtype(dt_vlstr) 
dt_int = np.dtype('i4')

constructor = {'names':['ark','dates','views'],
'formats': [dt_vlstr, #ark
            dt_vlstr_arr, #dates
            dt_int, #lecture / views
            ]}            
compound = np.dtype(constructor)

# using generator to convert data:
data_arr1 = np.empty((3,), dtype=compound)
data_arr1[:]['ark'] = ['row_0', 'row_1_long', 'row_2_longest']
data_arr1[:]['dates'] = [np.array([['a', 'bbb'],['cc', 'ddd']]),
                        np.array([['i', 'jjj'],['kk', 'lll']]),
                        np.array([['w', 'xxx'],['yy', 'zzz']]) ]
data_arr1[:]['views'] = [i for i in range(1,4)]

numpy_arr = mapping2numpy(data_arr1)

with h5py.File('file.h5', 'w') as h5f:
    group = h5f.create_group('metadata')
    group.create_dataset('records1', (0,), maxshape=(None,), dtype=compound)
                                    
with h5py.File('file.h5', 'a') as h5f: 
    dset = h5f['metadata/records1']
    for i, data in enumerate(numpy_arr):
        dset.resize((dset.shape[0]+1,))
        dset[i] = data
    print(dset[:])
    print(dset.chunks)

我不知道你为什么要写生成器。我怀疑您需要将某些数据对象类型转换为可变长度字符串和数组类型。正如我在评论中提到的,逐行加载数据集是最慢的方法。如果您只加载 1000 行,这无关紧要。但是,如果您需要加载很多行 (10e6),该过程将非常缓慢。详情请看这个问答:pytables 写的比 h5py 快多了。为什么? 忽略 PyTables 部分,关注频繁写入少量行时的性能问题。

这里是我修改过的示例,显示直接从 recarray 写入,逐行(到

records2
)和一次全部(到
records3
)。我强烈推荐最后一种方法(但一次超过 3 行)。它从上面的代码继续。

# loading data directly from numpy recarray:
data_arr2 = np.empty((3,), dtype=compound)
data_arr2[:]['ark'] = np.array(['row_0', 'row_1_long', 'row_2_longest'], dtype=dt_vlstr)
data_arr2[:]['dates'] = [np.array([['a', 'bbb'],['cc', 'ddd']]).astype(dt_vlstr),
                        np.array([['i', 'jjj'],['kk', 'lll']]).astype(dt_vlstr),
                        np.array([['w', 'xxx'],['yy', 'zzz']]).astype(dt_vlstr) ]
data_arr2[:]['views'] = [i for i in range(1,4)]
    
# loading row-by-row -- NOT recommended
with h5py.File('file.h5', 'a') as h5f: 
    h5f['metadata'].create_dataset('records2', (0,), maxshape=(None,), dtype=compound)
    dset = h5f['metadata/records2']
    for i, data in enumerate(data_arr2):
        dset.resize((dset.shape[0]+1,))
        dset[i] = data
    print(dset[:])
    print(dset.chunks)   
    
# loading all at once -- preferred method
with h5py.File('file.h5', 'a') as h5f:     
    h5f['metadata'].create_dataset('records3', data=data_arr2, maxshape=(None,))
    print(h5f['metadata/records3'][:])   
    print(h5f['metadata/records3'].chunks)   
© www.soinside.com 2019 - 2024. All rights reserved.