如何使用带有包含numpy数组的单元格的熊猫创建和附加到hdf5表？

Question

我想创建一个包含x: np.array，y: int32，z: np.array 3列的h5数据库。但是我不断收到奇怪的错误。

store = pd.HDFStore('store.h5')
df = pd.DataFrame(columns=['x', 'y', 'z'])
df['y'] = df['y'].astype(np.int32)

for i in range(10):
    arr = np.random.randn(3, 3)
    df = df.append({'x': arr, 'y': 1, 'z': arr}, ignore_index=True)

store.append('df', df)

这给出了错误：

TypeError: object of type 'int' has no len()

它与整数y列无关，因为我尝试使用3个数组来进行此操作，但出现相同的错误。我去过很多次文档，但是仍然不知道我在做什么错。希望你们中的一个好人能为您提供帮助。

Answer 1

将您的简单专栏文章（两次有趣）写成table：

In [514]: df[['y','y']].to_hdf('pdtest.h5','df', format='table') In [515]: df1 = pd.read_hdf('pdtest.h5') In [516]: df1 Out[516]: y y 0 1 1 1 1 1 2 1 1 3 1 1 4 1 1 5 1 1 6 1 1 7 1 1 8 1 1 9 1 1

用h5py看：In [517]: f=h5py.File('pdtest.h5','r')                                                   
In [518]: f.keys()                                                                       
Out[518]: <KeysViewHDF5 ['df']>
In [519]: f['df'].keys()                                                                 
Out[519]: <KeysViewHDF5 ['_i_table', 'table']>
In [521]: f['df/table']                                                                  
Out[521]: <HDF5 dataset "table": shape (10,), type "|V24">
In [522]: f['df/table'][:]                                                               
Out[522]: 
array([(0, [1, 1]), (1, [1, 1]), (2, [1, 1]), (3, [1, 1]), (4, [1, 1]),
       (5, [1, 1]), (6, [1, 1]), (7, [1, 1]), (8, [1, 1]), (9, [1, 1])],
      dtype=[('index', '<i8'), ('values_block_0', '<i8', (2,))])

它已将'表'另存为一个numpy结构化数组。在fixed中：
In [525]: df[['y']].to_hdf('pdtest.h5','df', format='fixed')                             
In [526]: df1 = pd.read_hdf('pdtest.h5')      
In [528]: f=h5py.File('pdtest.h5','r')                                                   
In [529]: f.keys()                                                                       
Out[529]: <KeysViewHDF5 ['df']>
In [530]: f['df'].keys()                                                                 
Out[530]: <KeysViewHDF5 ['axis0', 'axis1', 'block0_items', 'block0_values']>

或返回保存整个数据框：In [539]: df.to_hdf('pdtest.h5','df')                                                    
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py:2505: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->Index(['x', 'z'], dtype='object')]

  encoding=encoding,
In [540]: f.close()                                                                      
In [541]: f=h5py.File('pdtest.h5','r')                                                   
In [542]: f.keys()                                                                       
Out[542]: <KeysViewHDF5 ['df']>
In [543]: f['df'].keys()                                                                 
Out[543]: <KeysViewHDF5 ['axis0', 'axis1', 'block0_items', 'block0_values', 'block1_items', 'block1_values']>

显然，这两个块由具有不同存储要求的列组成。简单列：
In [546]: f['df/block0_items'][:]                                                        
Out[546]: array([b'y'], dtype='|S1')
In [547]: f['df/block0_values'][:]                                                       
Out[547]: 
array([[1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1]])

并且它推导出的数组列包含与存储为对象dtype数组相同的数组，其中包含一个数组：In [548]: f['df/block1_items'][:]                                                        
Out[548]: array([b'x', b'z'], dtype='|S1')
In [549]: f['df/block1_values'][:]                                                       
Out[549]: 
array([array([128,   4, 149, ..., 148,  98,  46], dtype=uint8)],
      dtype=object)

希望这澄清了为什么pandas无法以table格式保存数据框。 HDF5有其自己的存储布局。 h5py是HDF5的一个相对较低级别的接口，在Python端具有numpy数组（显然，该匹配是相当接近和透明的）。 pandas在其顶部添加另一层。

如何使用带有包含numpy数组的单元格的熊猫创建和附加到hdf5表？

问题描述投票：0回答：1

1个回答

最新问题

如何使用带有包含numpy数组的单元格的熊猫创建和附加到hdf5表？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1