从 HDF5 访问数据 - 切片/提取数据

Question

我正在尝试定位 HDF5 文件中包含的大矩阵中的特定行。我有一个 .txt 文件，其中包含 HDF5 文件中存在的感兴趣的 id，并希望输出这些行的相应数据 - 所有相应的数据都是数字。

我编写了以下代码，但输出仅包含 ids（单列）。我需要附加到这些行的剩余数据（后续列中的数据）。任何建议将不胜感激！

import os 
import h5py


mydir = os.path.expanduser("~/Desktop/alexs-stuff/")

in_file = mydir + "EMP/EMPopen/full_emp_table_hdf5.h5"
wanted_file = mydir + "EMP/greengenes-curto-only.txt"
out_file = mydir + "EMP/emp-curto-only.txt"

wanted = set()

with open(wanted_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)

hdf5_file = h5py.File(in_file, "r")

count = 0

with open(out_file, "w") as h:
    for keys in hdf5_file["observation"]["ids"]:
        if keys in wanted:
            count = count + 1
            h.write(keys + "\n")

print "Converted %i records" % count 

hdf5_file.close()

如果有帮助，这里是 hdf5 文件的结构：

<HDF5 file "full_emp_table_hdf5.h5" (mode r)> (File) /
 sample     /sample (Group) /sample
     metadata     /sample/metadata (Group) /sample/metadata
     group-metadata     /sample/group-metadata (Group) /sample/group-metadata
     ids     /sample/ids (Dataset) /sample/ids     len = (15481,) object
     matrix     /sample/matrix (Group) /sample/matrix
         indices     /sample/matrix/indices (Dataset) /sample/matrix/indices     len = (107439386,) int32
         indptr     /sample/matrix/indptr (Dataset) /sample/matrix/indptr     len = (15482,) int32
         data     /sample/matrix/data (Dataset) /sample/matrix/data     len = (107439386,) float64
 observation     /observation (Group) /observation
     metadata     /observation/metadata (Group) /observation/metadata
         taxonomy     /observation/metadata/taxonomy (Dataset) /observation/metadata/taxonomy     len = (5594412, 7) object
     group-metadata     /observation/group-metadata (Group) /observation/group-metadata
     ids     /observation/ids (Dataset) /observation/ids     len = (5594412,) object
     matrix     /observation/matrix (Group) /observation/matrix
         indices     /observation/matrix/indices (Dataset) /observation/matrix/indices     len = (107439386,) int32
         indptr     /observation/matrix/indptr (Dataset) /observation/matrix/indptr     len = (5594413,) int32
         data     /observation/matrix/data (Dataset) /observation/matrix/data     len = (107439386,) float64

附加信息：

type(hdf5_file['observation']['ids'])
>>> <class 'h5py._hl.dataset.Dataset'>


dir(hdf5_file['observation']['ids'])
>>> ['__array__', '__class__', '__delattr__', '__dict__', '__doc__', '__eq__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_d', '_dcpl', '_e', '_filters', '_id', '_lapl', '_lcpl', '_local', 'astype', 'attrs', 'chunks', 'compression', 'compression_opts', 'dims', 'dtype', 'file', 'fillvalue', 'fletcher32', 'id', 'len', 'maxshape', 'name', 'parent', 'read_direct', 'ref', 'regionref', 'resize', 'scaleoffset', 'shape', 'shuffle', 'size', 'value', 'write_direct']

Answer 1

我认为这个

hdf5

文件是由

biom-format

软件

http://biom-format.org

生成的，来自其

pypi

包：https://pypi.python.org/packages/source/b/biom-format/biom-格式-2.1.2.tar.gz

我可以提取样本

hdf5

文件：

examples/min_sparse_otu_table_hdf5.biom

，并使用以下方法处理它：

import numpy as np
from scipy import sparse as sp
import h5py

f = h5py.File('min_sparse_otu_table_hdf5.biom','r')
f.visit(print)  # to see its structure
"""
observation
observation/group-metadata
observation/ids
observation/matrix
observation/matrix/data
observation/matrix/indices
observation/matrix/indptr
observation/metadata
sample
sample/group-metadata
sample/ids
sample/matrix
sample/matrix/data
sample/matrix/indices
sample/matrix/indptr
sample/metadata
"""

或更详细：

f.visititems(lambda name,obj:print(name, obj))
"""
...
sample <HDF5 group "/sample" (4 members)>
sample/group-metadata <HDF5 group "/sample/group-metadata" (0 members)>
sample/ids <HDF5 dataset "ids": shape (6,), type "|O4">
sample/matrix <HDF5 group "/sample/matrix" (3 members)>
sample/matrix/data <HDF5 dataset "data": shape (15,), type "<f8">
sample/matrix/indices <HDF5 dataset "indices": shape (15,), type "<i4">
sample/matrix/indptr <HDF5 dataset "indptr": shape (7,), type "<i4">
sample/metadata <HDF5 group "/sample/metadata" (0 members)>
"""

这与您的

full_emp_table_hdf5.h5

具有相似的结构

使用

.value

（或

[:]

）加载数组

sample_ids = f['sample/ids'].value
"""
array([b'Sample1', b'Sample2', b'Sample3', b'Sample4', b'Sample5',
       b'Sample6'], dtype=object)
"""

# create a sparse array from the sample/matrix group
data = f['sample/matrix/data'].value
indices = f['sample/matrix/indices'].value
indptr = f['sample/matrix/indptr'].value
sample = sp.csr_matrix((data,indices,indptr))
# display as a dense array
print(sample.A)

""" 
array([[ 0.,  5.,  0.,  2.,  0.],
       [ 0.,  1.,  0.,  1.,  1.],
       [ 1.,  0.,  1.,  1.,  1.],
       [ 0.,  2.,  4.,  0.,  0.],
       [ 0.,  3.,  0.,  0.,  0.],
       [ 0.,  1.,  2.,  1.,  0.]])
"""

# select a specific row
print(sample[2,:])
"""
  (0, 0)    1.0
  (0, 2)    1.0
  (0, 3)    1.0
  (0, 4)    1.0
"""
print(sample[2,:].A)  # that row in dense format
# array([[ 1.,  0.,  1.,  1.,  1.]])

我可以将

sample

写入文本文件：

np.savetxt('min_sample.txt',sample.A, '%.2f')

"""
0.00 5.00 0.00 2.00 0.00
0.00 1.00 0.00 1.00 1.00
1.00 0.00 1.00 1.00 1.00
0.00 2.00 4.00 0.00 0.00
0.00 3.00 0.00 0.00 0.00
0.00 1.00 2.00 1.00 0.00
"""

biom

模块可能可以完成所有这些以及更多功能，但基础知识只需要

numpy

和

sparse

模块。

Answer 2

更改线路：

h.write(keys + "\n")

对此：

h.write(hdf5_file['observation']['ids'][keys] + "\n")

Answer 3

我没有使用 python 找到简单的答案，但如果其他人遇到类似的问题，请使用：

$biom subset-table -i input_file.biom -a observation -s ids_wanted.txt -o biom

然后您可以转换它：

$biom convert -i outputfilefromabove.biom -o outputfilefromabove.txt --to_tsv

此方法更简单，并且利用了 biom 的丰富资源！

Answer 4

这个错误发生在我身上，因为我试图索引到

h5py.Dataset

。

发生这种情况通常有以下两个原因之一：

您认为您的物体是
```
h5py.Group
```
，而实际上它是
```
h5py.Dataset
```
。您可以通过在索引之前打印您尝试索引的内容的
```
type
```
来检查这一点，以确保它是
```
h5py.Group
```
而不是
```
h5py.Dataset
```
。（即在执行
```
print(type(thing))
```
之前先执行
```
thing["key"]
```
。您希望它说
```
h5py.Group
```
，而不是
```
h5py.Dataset
```
。）
您认为您的对象是一个 numpy 数组，但它是一个数据集。在这种情况下，您需要在
```
dataset[:]
```
上使用
```
h5py.Dataset
```
来获取其 numpy 数组，或使用
```
dataset[()]
```
来获取其标量值。

从 HDF5 访问数据 - 切片/提取数据

问题描述投票：0回答：4

4个回答

最新问题

从 HDF5 访问数据 - 切片/提取数据

问题描述 投票：0回答：4

4个回答

最新问题

问题描述投票：0回答：4