当复制或分组聚合一个 pandas DataFrame 时,如何保持主次顺序?

问题描述 投票:1回答:1

我如何使用或按顺序操作(monkey-patch)pandas,在复制和groupby聚合的结果对象上始终保持相同的主次顺序?

我使用 pandas.DataFrame 作为业务应用中的数据基础设施(风险模型),并需要对多维数据进行快速聚合。使用pandas进行聚合,关键是取决于底层numpy数组的主序方案。

不幸的是,pandas(0.23.4版)在我创建副本或使用groupby和sum进行聚合时,会改变底层numpy数组的主序。

其影响是

情况1: 17.2秒

情况2:5分46秒

在一个DataFrame和它的副本上,有45023行和100000列。对索引进行了聚合。该索引是一个 pd.MultiIndex 与15级。Aggregation保持三个级别,导致大约239个组。

我通常工作在有45000行和100000列的DataFrames上。在行上我有一个 pandas.MultiIndex 与大约15个层次。为了计算各个层次节点的统计,我需要在索引维度上进行聚合(sum)。

聚合的速度很快,如果底层的numpy数组是 c_contiguous,因此按列-主阶(C阶)来举行。如果它是 f_contiguous,因此按行-主阶(F阶)。

不幸的是。pandas将大调从C调改为F调,当

  • 创设 数据帧的副本 甚至当。

  • 进行 聚合,通过一个grouby 并在grouper上取和。因此,产生的 DataFrame 有一个不同的主阶 (!)

当然,我可以坚持使用另一个 "数据模型",只是在列上保留MultiIndex。那么当前的pandas版本就会一直对我有利。但这是不可能的。我认为,我们可以预期,对于正在考虑的两个操作(groupby-sum和copy),主要顺序不应该被改变。

import numpy as np
import pandas as pd

print("pandas version: ", pd.__version__)

array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
array.flags
print("Numpy array is C-contiguous: ", data.flags.c_contiguous)

dataframe = pd.DataFrame(array, index = pd.MultiIndex.from_tuples([('A', 'U'), ('A', 'V'), ('B', 'W')], names=['dim_one', 'dim_two']))
print("DataFrame is C-contiguous: ", dataframe.values.flags.c_contiguous)

dataframe_copy = dataframe.copy()
print("Copy of DataFrame is C-contiguous: ", dataframe_copy.values.flags.c_contiguous)

aggregated_dataframe = dataframe.groupby('dim_one').sum()
print("Aggregated DataFrame is C-contiguous: ", aggregated_dataframe.values.flags.c_contiguous)


## Output in Jupyter Notebook
# pandas version:  0.23.4
# Numpy array is C-contiguous:  True
# DataFrame is C-contiguous:  True
# Copy of DataFrame is C-contiguous:  False
# Aggregated DataFrame is C-contiguous:  False

数据的主序应该被保留下来。如果pandas喜欢切换到一个隐式偏好,那么它应该允许覆盖这个偏好。Numpy 允许在创建副本时输入顺序。

pandas的补丁版本应该会导致以下结果

## Output in Jupyter Notebook
# pandas version:  0.23.4
# Numpy array is C-contiguous:  True
# DataFrame is C-contiguous:  True
# Copy of DataFrame is C-contiguous:  True
# Aggregated DataFrame is C-contiguous:  True

的示例代码。

python pandas performance pandas-groupby column-major-order
1个回答
0
投票

潘达斯的猴子补丁(0.23.4,也许还有其他版本)

我创建了一个补丁,我想与大家分享。它的结果是提高了上面问题中提到的性能。

它适用于pandas 0.23.4版本。对于其他版本,你需要尝试它是否仍然有效。

以下两个模块是需要的,你可能会根据你放置的位置来调整导入的模块。

memory_layout.py   
memory.py

要给你的代码打补丁,你只需要在你的程序或笔记本的最开始导入以下模块,并设置内存布局参数。它将为pandas打上补丁,并确保DataFrames行为的副本具有所需的布局。

from memory_layout import memory_layout
# memory_layout.order = 'F'  # assert F-order on copy
# memory_layout.order = 'K'  # Keep given layout on copy 
memory_layout.order = 'C'  # assert C-order on copy

memory_layout.py

创建文件memory_layout.py,内容如下。

import numpy as np
from pandas.core.internals import Block
from memory import memory_layout

# memory_layout.order = 'F'  # set memory layout order to 'F' for np.ndarrays in DataFrame copies (fortran/row order)
# memory_layout.order = 'K'  # keep memory layout order for np.ndarrays in DataFrame copies (order out is order in)
memory_layout.order = 'C'  # set memory layout order to 'C' for np.ndarrays in DataFrame copies (C/column order)


def copy(self, deep=True, mgr=None):
    """
    Copy patch on Blocks to set or keep the memory layout
    on copies.

    :param self: `pandas.core.internals.Block`
    :param deep: `bool`
    :param mgr: `BlockManager`
    :return: copy of `pandas.core.internals.Block`
    """
    values = self.values
    if deep:
        if isinstance(values, np.ndarray):
memory_layout))
            values = memory_layout.copy_transposed(values)
memory_layout))
        else:
            values = values.copy()
    return self.make_block_same_class(values)


Block.copy = copy  # Block for pandas 0.23.4: in pandas.core.internals.Block

memory.py

创建文件memory.py,内容如下。

"""
Implements MemoryLayout copy factory to change memory layout
of `numpy.ndarrays`.
Depending on the use case, operations on DataFrames can be much
faster if the appropriate memory layout is set and preserved.

The implementation allows for changing the desired layout. Changes apply when
copies or new objects are created, as for example, when slicing or aggregating
via groupby ...

This implementation tries to solve the issue raised on GitHub
https://github.com/pandas-dev/pandas/issues/26502

"""
import numpy as np

_DEFAULT_MEMORY_LAYOUT = 'K'


class MemoryLayout(object):
    """
    Memory layout management for numpy.ndarrays.

    Singleton implementation.

    Example:
    >>> from memory import memory_layout
    >>> memory_layout.order = 'K'  #
    >>> # K ... keep array layout from input
    >>> # C ... set to c-contiguous / column order
    >>> # F ... set to f-contiguous / row order
    >>> array = memory_layout.apply(array)
    >>> array = memory_layout.apply(array, 'C')
    >>> array = memory_layout.copy(array)
    >>> array = memory_layout.apply_on_transpose(array)

    """

    _order = _DEFAULT_MEMORY_LAYOUT
    _instance = None

    @property
    def order(self):
        """
        Return memory layout ordering.

        :return: `str`
        """
        if self.__class__._order is None:
            raise AssertionError("Array layout order not set.")
        return self.__class__._order

    @order.setter
    def order(self, order):
        """
        Set memory layout order.
        Allowed values are 'C', 'F', and 'K'. Raises AssertionError
        when trying to set other values.

        :param order: `str`
        :return: `None`
        """
        assert order in ['C', 'F', 'K'], "Only 'C', 'F' and 'K' supported."
        self.__class__._order = order

    def __new__(cls):
        """
        Create only one instance throughout the lifetime of this process.

        :return: `MemoryLayout` instance as singleton
        """
        if cls._instance is None:
            cls._instance = super(MemoryLayout, cls).__new__(MemoryLayout)
        return cls._instance

    @staticmethod
    def get_from(array):
        """
        Get memory layout from array

        Possible values:
           'C' ... only C-contiguous or column order
           'F' ... only F-contiguous or row order
           'O' ... other: both, C- and F-contiguous or both
           not C- or F-contiguous (as on empty arrays).

        :param array: `numpy.ndarray`
        :return: `str`
        """
        if array.flags.c_contiguous == array.flags.f_contiguous:
            return 'O'
        return {True: 'C', False: 'F'}[array.flags.c_contiguous]

    def apply(self, array, order=None):
        """
        Apply the order set or the order given as input on the array
        given as input.

        Possible values:
           'C' ... apply C-contiguous layout or column order
           'F' ... apply F-contiguous layout or row order
           'K' ... keep the given layout

        :param array: `numpy.ndarray`
        :param order: `str`
        :return: `np.ndarray`
        """
        order = self.__class__._order if order is None else order

        if order == 'K':
            return array

        array_order = MemoryLayout.get_from(array)
        if array_order == order:
            return array

        return np.reshape(np.ravel(array), array.shape, order=order)

    def copy(self, array, order=None):
        """
        Return a copy of the input array with the memory layout set.
        Layout set:
           'C' ... return C-contiguous copy
           'F' ... return F-contiguous copy
           'K' ... return copy with same layout as
           given by the input array.

        :param array: `np.ndarray`
        :return: `np.ndarray`
        """
        order = order if order is not None else self.__class__._order
        return array.copy(order=self.get_from(array)) if order == 'K' \
            else array.copy(order=order)

    def copy_transposed(self, array):
        """
        Return a copy of the input array in order that its transpose
        has the memory layout set.

        Note: numpy simply changes the memory layout from row to column
        order instead of reshuffling the data in memory.

        Layout set:
           'C' ... return F-contiguous copy
           'F' ... return C-contiguous copy
           'K' ... return copy with oposite (C versus F) layout as
           given by the input array.

        :param array: `np.ndarray`
        :return: `np.ndarray`

        :param array:
        :return:
        """
        if self.__class__._order == 'K':
            return array.copy(
                order={'C': 'C', 'F': 'F', 'O': None}[self.get_from(array)])
        else:
            return array.copy(
                order={'C': 'F', 'F': 'C'}[self.__class__._order])

    def __str__(self):
        return str(self.__class__._order)


memory_layout = MemoryLayout()  # Singleton
© www.soinside.com 2019 - 2024. All rights reserved.