如何使用 Numpy dtypes 将二进制文件读入 Pandas DataFrame？

Question

我想删除通过使用 Numpy.dtype 模板读取二进制文件生成的 DataFrame 中的行。我使用了多种方法删除一行并继续受到错误的阻碍，通常是：

TypeError: void() 至少需要 1 个位置参数（给定 0 个）

在 IDE 中打开变量资源管理器在尝试检查列名称时显示相同的错误，这表明提取数据的不正确方法在某种程度上损坏了列名称。

我按以下方式加载数据（为简洁起见，此处缩短了变量数量）：

```
data_template = np.dtype([
    ('header_a','V22'),
    ('variable_A','>u2'),
    ('gpssec','>u4')
    ])

with open(source_file, 'rb') as f: byte_data = f.read()
np_data = np.frombuffer(byte_data, data_template)
df = pd.DataFrame(np_data)
```

当我尝试使用一种方法来减少 DataFrame 时。

`df = df[df['gpssec'] > 1000]`

我明白了...

    File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\frame.py:3798 in __getitem__
      return self._getitem_bool_array(key)

    File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\frame.py:3853 in _getitem_bool_array
      return self._take_with_is_copy(indexer, axis=0)

    File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\generic.py:3902 in _take_with_is_copy
      result = self._take(indices=indices, axis=axis)

    File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\generic.py:3886 in _take
      new_data = self._mgr.take(

    File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\internals\managers.py:978 in take
      return self.reindex_indexer(

    File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\internals\managers.py:751 in  reindex_indexer
      new_blocks = [

    File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\internals\managers.py:752 in <listcomp>
      blk.take_nd(

    File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\internals\blocks.py:880 in take_nd
      new_values = algos.take_nd(

    File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\array_algos\take.py:117 in take_nd
      return _take_nd_ndarray(arr, indexer, axis, fill_value, allow_fill)

    File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\array_algos\take.py:134 in _take_nd_ndarray
      dtype, fill_value, mask_info = _take_preprocess_indexer_and_fill_value(

    File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\array_algos\take.py:582 in _take_preprocess_indexer_and_fill_value
      dtype, fill_value = arr.dtype, arr.dtype.type()

    TypeError: void() takes at least 1 positional argument (0 given)

    ```

I've been able to work around the problem by copying each column of relevant data into a blank DataFrame that doesn't have the corrupt headers, but it's a kludgy solution. Not sure this qualifies as a bug as it's very likely it's a user error, but I can't find anything obvious I'm doing wrong.

Answer 1

In [230]: data_template = np.dtype([
     ...:     ('header_a','V22'),
     ...:     ('variable_A','>u2'),
     ...:     ('gpssec','>u4')
     ...:     ])

从此数据类型创建虚拟数组：

In [231]: arr = np.zeros(4, data_template)
In [232]: arr
Out[232]: 
array([(b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 0, 0),
       (b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 0, 0),
       (b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 0, 0),
       (b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 0, 0)],
      dtype=[('header_a', 'V22'), ('variable_A', '>u2'), ('gpssec', '>u4')])

我们可以用它制作一个数据框：

In [233]: df = pd.DataFrame(arr)

In [234]: df.describe()
Out[234]: 
       variable_A  gpssec
count         4.0     4.0
mean          0.0     0.0
std           0.0     0.0
min           0.0     0.0
25%           0.0     0.0
50%           0.0     0.0
75%           0.0     0.0
max           0.0     0.0

但是显示或信息引发错误：

In [235]: df.info()
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

如何使用 Numpy dtypes 将二进制文件读入 Pandas DataFrame？

问题描述投票：0回答：1

1个回答

最新问题

如何使用 Numpy dtypes 将二进制文件读入 Pandas DataFrame？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1