我想创建一个包含x: np.array
,y: int32
,z: np.array
3列的h5数据库。但是我不断收到奇怪的错误。
store = pd.HDFStore('store.h5')
df = pd.DataFrame(columns=['x', 'y', 'z'])
df['y'] = df['y'].astype(np.int32)
for i in range(10):
arr = np.random.randn(3, 3)
df = df.append({'x': arr, 'y': 1, 'z': arr}, ignore_index=True)
store.append('df', df)
这给出了错误:
TypeError: object of type 'int' has no len()
它与整数y
列无关,因为我尝试使用3个数组来进行此操作,但出现相同的错误。我去过很多次文档,但是仍然不知道我在做什么错。希望你们中的一个好人能为您提供帮助。
table
:In [514]: df[['y','y']].to_hdf('pdtest.h5','df', format='table')
In [515]: df1 = pd.read_hdf('pdtest.h5')
In [516]: df1
Out[516]:
y y
0 1 1
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
8 1 1
9 1 1
用h5py
看:
In [517]: f=h5py.File('pdtest.h5','r') In [518]: f.keys() Out[518]: <KeysViewHDF5 ['df']> In [519]: f['df'].keys() Out[519]: <KeysViewHDF5 ['_i_table', 'table']> In [521]: f['df/table'] Out[521]: <HDF5 dataset "table": shape (10,), type "|V24"> In [522]: f['df/table'][:] Out[522]: array([(0, [1, 1]), (1, [1, 1]), (2, [1, 1]), (3, [1, 1]), (4, [1, 1]), (5, [1, 1]), (6, [1, 1]), (7, [1, 1]), (8, [1, 1]), (9, [1, 1])], dtype=[('index', '<i8'), ('values_block_0', '<i8', (2,))])
它已将'表'另存为一个numpy结构化数组。在
fixed
中:
In [525]: df[['y']].to_hdf('pdtest.h5','df', format='fixed') In [526]: df1 = pd.read_hdf('pdtest.h5') In [528]: f=h5py.File('pdtest.h5','r') In [529]: f.keys() Out[529]: <KeysViewHDF5 ['df']> In [530]: f['df'].keys() Out[530]: <KeysViewHDF5 ['axis0', 'axis1', 'block0_items', 'block0_values']>
或返回保存整个数据框:
In [539]: df.to_hdf('pdtest.h5','df') /usr/local/lib/python3.6/dist-packages/pandas/core/generic.py:2505: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->block1_values] [items->Index(['x', 'z'], dtype='object')] encoding=encoding, In [540]: f.close() In [541]: f=h5py.File('pdtest.h5','r') In [542]: f.keys() Out[542]: <KeysViewHDF5 ['df']> In [543]: f['df'].keys() Out[543]: <KeysViewHDF5 ['axis0', 'axis1', 'block0_items', 'block0_values', 'block1_items', 'block1_values']>
显然,这两个块由具有不同存储要求的列组成。简单列:
In [546]: f['df/block0_items'][:] Out[546]: array([b'y'], dtype='|S1') In [547]: f['df/block0_values'][:] Out[547]: array([[1], [1], [1], [1], [1], [1], [1], [1], [1], [1]])
并且它推导出的数组列包含与存储为对象dtype数组相同的数组,其中包含一个数组:
In [548]: f['df/block1_items'][:] Out[548]: array([b'x', b'z'], dtype='|S1') In [549]: f['df/block1_values'][:] Out[549]: array([array([128, 4, 149, ..., 148, 98, 46], dtype=uint8)], dtype=object)
希望这澄清了为什么pandas
无法以table
格式保存数据框。HDF5
有其自己的存储布局。h5py
是HDF5
的一个相对较低级别的接口,在Python端具有numpy
数组(显然,该匹配是相当接近和透明的)。pandas
在其顶部添加另一层。