如何使用h5py为每个HDF5列定义单独的数据类型

Question

我已经检查了不同的解决方案，但是不明白如何将它们应用于多维数组。确切地说，我的代码生成的数组比应有的大，如下图所示：

import h5py
import pandas as pd
import numpy as np

data = [[1583663558450195, -7.063664436340332, -6.2776079177856445, -4.206898212432861, -4.206898212432861], [1583663558450195, -7.063664436340332, -6.2776079177856445, -4.206898212432861, -4.206898212432861], [1583663558450195, -7.063664436340332, -6.2776079177856445, -4.206898212432861, -4.206898212432861], [1583663558450195, -7.063664436340332, -6.2776079177856445, -4.206898212432861, -4.206898212432861], [1583663558450195, -7.063664436340332, -6.2776079177856445, -4.206898212432861, -4.206898212432861], [1583663558450195, -7.063664436340332, -6.2776079177856445, -4.206898212432861, -4.206898212432861], [1583663558450195, -7.063664436340332, -6.2776079177856445, -4.206898212432861, -4.206898212432861], [1583663558450195, -7.063664436340332, -6.2776079177856445, -4.206898212432861, -4.206898212432861], [1583663558450195, -7.063664436340332, -6.2776079177856445, -4.206898212432861, -4.206898212432861], [1583663558450195, -7.063664436340332, -6.2776079177856445, -4.206898212432861, -4.206898212432861]]

df = pd.DataFrame(data)

hf = h5py.File('dtype.h5', 'w')

dataTypes = np.dtype([('ts', 'u8'), ('x', 'f4'), ('y', 'f4'), ('z', 'f4'), ('temp', 'f4')])
ds = hf.create_dataset('Acceleration', data=df.astype(dataTypes))

我想这样，分别是uint64和4x float32列：

                 ts         x         y         z      temp
0  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
1  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
2  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
3  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
4  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
5  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
6  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
7  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
8  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
9  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898

Answer 1

您的df：

In [370]: df                                                                                   
Out[370]: 
                  0         1         2         3         4
0  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
1  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
2  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
3  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
...

[df.astype(dataTypes)给我一个TypeError（我的pd不是最新的）。

In [373]: df.to_records()                                                                      
Out[373]: 
rec.array([(0, 1583663558450195, -7.06366444, -6.27760792, -4.20689821, -4.20689821),
           (1, 1583663558450195, -7.06366444, -6.27760792, -4.20689821, -4.20689821),
           (2, 1583663558450195, -7.06366444, -6.27760792, -4.20689821, -4.20689821),
           (3, 1583663558450195, -7.06366444, -6.27760792, -4.20689821, -4.20689821),
           (4, 1583663558450195, -7.06366444, -6.27760792, -4.20689821, -4.20689821),
           (5, 1583663558450195, -7.06366444, -6.27760792, -4.20689821, -4.20689821),
           (6, 1583663558450195, -7.06366444, -6.27760792, -4.20689821, -4.20689821),
           (7, 1583663558450195, -7.06366444, -6.27760792, -4.20689821, -4.20689821),
           (8, 1583663558450195, -7.06366444, -6.27760792, -4.20689821, -4.20689821),
           (9, 1583663558450195, -7.06366444, -6.27760792, -4.20689821, -4.20689821)],
          dtype=[('index', '<i8'), ('0', '<i8'), ('1', '<f8'), ('2', '<f8'), ('3', '<f8'), ('4', '<f8')])

此数组应保存为h5py。

to_records的参数可能会更接近您的dataTypes。我会让你探索那些。

但是使用最新的recfunctions重组，我们可以使用以下方法制作结构化数组：

In [385]: import numpy.lib.recfunctions as rf                                                  
In [386]: rf.unstructured_to_structured(np.array(data), dataTypes)                             
Out[386]: 
array([(1583663558450195, -7.0636644, -6.277608, -4.206898, -4.206898),
       (1583663558450195, -7.0636644, -6.277608, -4.206898, -4.206898),
       (1583663558450195, -7.0636644, -6.277608, -4.206898, -4.206898),
       (1583663558450195, -7.0636644, -6.277608, -4.206898, -4.206898),
       (1583663558450195, -7.0636644, -6.277608, -4.206898, -4.206898),
       (1583663558450195, -7.0636644, -6.277608, -4.206898, -4.206898),
       (1583663558450195, -7.0636644, -6.277608, -4.206898, -4.206898),
       (1583663558450195, -7.0636644, -6.277608, -4.206898, -4.206898),
       (1583663558450195, -7.0636644, -6.277608, -4.206898, -4.206898),
       (1583663558450195, -7.0636644, -6.277608, -4.206898, -4.206898)],
      dtype=[('ts', '<u8'), ('x', '<f4'), ('y', '<f4'), ('z', '<f4'), ('temp', '<f4')])

[np.array(data)是（10,5）浮点数组。

In [388]: pd.DataFrame(_386)                                                                   
Out[388]: 
                 ts         x         y         z      temp
0  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
1  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
2  1583663558450195 -7.063664 -6.277608 -4.206898 -4.206898
 ...

Answer 2

这个问题比它最初出现时要棘手。最初，我认为我可以将与您以前的问题SO 60562311:define individual datatypes for each column相同的方法应用于我的答案。但是，它有一些细微的差异：

此数据是列表列表与5x5 NumPy数组的列表
此数据是混合类型（整数和浮点数）对所有浮点数
此数据比上一个示例具有更大的数字

这如何改变程序？

列表列表可以使用以下命令转换为NumPy数组：np.array(data)但是，这不能完全解决问题。您仍将获得重复的列。
您还需要在dtype声明中更改对象类型。 f4需要为f8，并且u8需要为uint16

进行这些更改，所有操作都与我之前的答案相同。请在下面查看对原始代码的更新。

dataTypes = np.dtype([('ts', 'uint16'), ('x', 'f8'), 
            ('y', 'f8'), ('z', 'f8'), ('temp', 'f8')])
# create array from list of lists
d_arr = np.array(data) 
# create record array
rec_arr = np.rec.array(d_arr, dtype=dataTypes)
with h5py.File('dtype.h5', 'w') as hf:
    ds = hf.create_dataset('Acceleration', data=rec_arr)

如何使用h5py为每个HDF5列定义单独的数据类型

问题描述投票：0回答：2

2个回答

最新问题

如何使用h5py为每个HDF5列定义单独的数据类型

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2