H5py，将匹配的行从巨大的hdf5文件合并到较小的数据集

Question

我有两个巨大的hdf5文件，每个文件都有一个ID索引，每个文件都包含有关每个ID的不同信息。

我已经读入一个小的蒙版数据集（数据），仅使用少数几个ID。现在，我想使用有关第二个hdf5文件（s_data）的一列（a）中的选择ID的信息来添加到数据集中。

目前，我必须阅读整个第二个hdf5文件，并按照以下说明选择匹配的ID：

for i in range(len(data['ids'])):
        print(i)
        data['a'][i] = s_data['a'][s_data['ids'] == data['ids'][i]]

现在需要1.9亿个ID，这要花很长时间。这是更简单的匹配方式吗？我正在考虑使用熊猫风格的连接，但是我找不到这种方法来使用h5py数据集。

非常感谢！

Answer 1

您是否考虑过PyTables？这是另一个读取HDF5文件的Python软件包。它具有基于OPSI（优化的部分排序索引）的快速搜索算法。在搜索条件中使用.read_where()方法将简化搜索过程，并且应比h5py更快。

您的问题与我上周回答的关于查找重复项的问题类似。您可以在这里阅读我的答案：Pytables duplicates 2.5 giga rows

[搜索之前，我将从'数据'的'ids'字段中获得唯一值的数组，以在.read_where()条件下搜索'sdata'。如果我了解您的流程和数据，代码将如下所示：

import tables as tb
# need to open HDF5 files  
h5f1 = tb.File('yourfile1.h5','r')
h5f2 = tb.File('yourfile2.h5','r')
# define data and sdata datasets:
data  = h5f1.root.data
sdata = h5f2.root.sdata

# Step 1: Get a Numpy array of the 'ids' field/column from data DS: 
ids_arr = data.read(field='ids')
# Step 2: Get new array with unique values only: 
uids_arr = np.unique(ids_arr)     

#Or, combine steps 1 and 2 into one line: 
uids_arr = np.unique(data.read(field='ids')) 

# Step 3a: Loop on rows of unique id values 
for id_test in uids_arr :

# Step 3b: Get an array with all rows that match this id value, 
#          Only returns values in field 'a' 
     match_row_arr = sdata.read_where('ids==id_test',field='a')

H5py，将匹配的行从巨大的hdf5文件合并到较小的数据集

问题描述投票：0回答：1

1个回答

最新问题

H5py，将匹配的行从巨大的hdf5文件合并到较小的数据集

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1