如何使用带有包含numpy数组的单元格的熊猫创建和附加到hdf5表?

问题描述 投票:0回答:1

我想创建一个包含x: np.arrayy: int32z: np.array 3列的h5数据库。但是我不断收到奇怪的错误。

store = pd.HDFStore('store.h5')
df = pd.DataFrame(columns=['x', 'y', 'z'])
df['y'] = df['y'].astype(np.int32)

for i in range(10):
    arr = np.random.randn(3, 3)
    df = df.append({'x': arr, 'y': 1, 'z': arr}, ignore_index=True)

store.append('df', df)

这给出了错误:

TypeError: object of type 'int' has no len()

它与整数y列无关,因为我尝试使用3个数组来进行此操作,但出现相同的错误。我去过很多次文档,但是仍然不知道我在做什么错。希望你们中的一个好人能为您提供帮助。

python pandas numpy hdf5
1个回答
0
投票
将您的简单专栏文章(两次有趣)写成table

In [514]: df[['y','y']].to_hdf('pdtest.h5','df', format='table') In [515]: df1 = pd.read_hdf('pdtest.h5') In [516]: df1 Out[516]: y y 0 1 1 1 1 1 2 1 1 3 1 1 4 1 1 5 1 1 6 1 1 7 1 1 8 1 1 9 1 1

h5py看:

In [517]: f=h5py.File('pdtest.h5','r') In [518]: f.keys() Out[518]: <KeysViewHDF5 ['df']> In [519]: f['df'].keys() Out[519]: <KeysViewHDF5 ['_i_table', 'table']> In [521]: f['df/table'] Out[521]: <HDF5 dataset "table": shape (10,), type "|V24"> In [522]: f['df/table'][:] Out[522]: array([(0, [1, 1]), (1, [1, 1]), (2, [1, 1]), (3, [1, 1]), (4, [1, 1]), (5, [1, 1]), (6, [1, 1]), (7, [1, 1]), (8, [1, 1]), (9, [1, 1])], dtype=[('index', '<i8'), ('values_block_0', '<i8', (2,))])

它已将'表'另存为一个numpy结构化数组。

fixed中:

In [525]: df[['y']].to_hdf('pdtest.h5','df', format='fixed') In [526]: df1 = pd.read_hdf('pdtest.h5') In [528]: f=h5py.File('pdtest.h5','r') In [529]: f.keys() Out[529]: <KeysViewHDF5 ['df']> In [530]: f['df'].keys() Out[530]: <KeysViewHDF5 ['axis0', 'axis1', 'block0_items', 'block0_values']>

或返回保存整个数据框:

In [539]: df.to_hdf('pdtest.h5','df') /usr/local/lib/python3.6/dist-packages/pandas/core/generic.py:2505: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->block1_values] [items->Index(['x', 'z'], dtype='object')] encoding=encoding, In [540]: f.close() In [541]: f=h5py.File('pdtest.h5','r') In [542]: f.keys() Out[542]: <KeysViewHDF5 ['df']> In [543]: f['df'].keys() Out[543]: <KeysViewHDF5 ['axis0', 'axis1', 'block0_items', 'block0_values', 'block1_items', 'block1_values']>

显然,这两个块由具有不同存储要求的列组成。

简单列:

In [546]: f['df/block0_items'][:] Out[546]: array([b'y'], dtype='|S1') In [547]: f['df/block0_values'][:] Out[547]: array([[1], [1], [1], [1], [1], [1], [1], [1], [1], [1]])

并且它推导出的数组列包含与存储为对象dtype数组相同的数组,其中包含一个数组:

In [548]: f['df/block1_items'][:] Out[548]: array([b'x', b'z'], dtype='|S1') In [549]: f['df/block1_values'][:] Out[549]: array([array([128, 4, 149, ..., 148, 98, 46], dtype=uint8)], dtype=object)

希望这澄清了为什么pandas无法以table格式保存数据框。 HDF5有其自己的存储布局。 h5pyHDF5的一个相对较低级别的接口,在Python端具有numpy数组(显然,该匹配是相当接近和透明的)。 pandas在其顶部添加另一层。
© www.soinside.com 2019 - 2024. All rights reserved.