如何从HDF文件提取数据？

Question

我有一个具有以下层次结构的HDF文件（.h5）

我从未处理过此类数据。我认为“条形码”应该是我的列，“功能/名称”应该是我的行，“数据”应该是值。如何最好在Python中以条形码Xnames数据框的形式提取它？

我尝试过此

    filename = "filtered_feature_bc_matrix.h5"

    with h5py.File(filename, "r") as f:
    data = f.get('matrix/data')
    dataset=np.array(data)
    print(dataset)

但是它给了我一个像[1 1 1 ... 2 1 2]的数据数组

Answer 1

这取决于文件中的数据结构，但其思想类似于字典。您需要调用右键才能访问特定的阵列（可能有多个）。因此最有可能是

filename = "filtered_feature_bc_matrix.h5"

with h5py.File(filename, "r") as f:
    data = f['matrix']['data']
    dataset=np.array(data)
    print(dataset)

Answer 2

import h5py
f = h5py.File('filtered_feature_bc_matrix.h5','r') 
f.keys()
f['data']
f.close()

Answer 3

[使用名为indptr和indicies的数据集，我认为matrix应该是scipy.sparse矩阵。

要加载此文件，您需要使用h5py文档和scipy.sparse.csr_matrix文档

csr_matrix((data, indices, indptr), [shape=(M, N)])
    is the standard CSR representation where the column indices for
    row i are stored in ``indices[indptr[i]:indptr[i+1]]`` and their
    corresponding values are stored in ``data[indptr[i]:indptr[i+1]]``.
    If the shape parameter is not supplied, the matrix dimensions
    are inferred from the index arrays.

大约是：

import h5py
from scipy import sparse

f = h5py.File(...)
gp = f['matrix']
data = gp['data'][:]
indptr = gp['indptr'][:]
indices = gp['indices'][:]
shape = gp['shape'][:]

M = sparse.csr_matrix((data, indices, indptr), shape=shape)

M.A   # dense numpy array equivalent (don't do this is `shape` is large)

data，indptr等应全部为numpy数组。 shape也将是一个数组，可能是2个数字，例如np.array([1000, 2000])。

Answer 4

我将首先介绍HDF5数据模式/对象。您需要了解这一点，才能有效处理HDF5数据。有2个基本实体：1）组和2）数据集。它们类似于计算机上的文件夹和文件。组就像文件夹，数据集就像保存数据的文件。（注意：这些都不是列名或行名，您可以通过查询每个数据集来获取该信息）。

组在文件布局的图像中使用文件夹图标。您有2个群组。它们是：

/matrix
/matrix/features

数据集具有不同的图标。您有10个数据集，每个组下保存了5个。它们是：

/matrix/barcodes
/matrix/data
/matrix/indices
/matrix/indptr
/matrix/shape

/matrix/features/_all_tag_keys
/matrix/features/feature_name
/matrix/features/genome
/matrix/features/id
/matrix/features/name

我创建了一个简单的示例来询问您的模型并输出组/数据集。此外，它还为数据集输出dtype和shape。这将帮助您务实地“看到”您拥有的东西并与您的图像进行比较。

Dtype是数据类型（int，float，string）以及字段/列名称（如果已定义）。
形状取决于数据集定义。对于简单的NumPy数组，它是各个方向的尺寸。对于混合数据类型（记录数组），它是行数。

下面的代码段：

import h5py

def visitor_func(name, node):
    if isinstance(node, h5py.Group):
        print(node.name, 'is a Group')
    elif isinstance(node, h5py.Dataset):
       if (node.dtype == 'object') :
            print (node.name, 'is an Object Dataset')
       else:
            print(node.name, 'is a Dataset')
            print ('Dataset dtype=', node.dtype)
            print ('Dataset shape=', node.shape) 

    else:
        print('Node is unknown type: ', node.name)           

print ('testing hdf5 file')
with h5py.File('filtered_feature_bc_matrix.h5','r') as h5f:
    h5f.visititems(visitor_func)

如何从HDF文件提取数据？

问题描述投票：0回答：4

4个回答

最新问题

如何从HDF文件提取数据？

问题描述 投票：0回答：4

4个回答

最新问题

问题描述投票：0回答：4