如何矢量化运行在 pandas Periodindex 上的 for 循环，我需要将日期时间排序到适当的时间段？

Question

我有一个数据框“timeseries”，它以日期时间作为索引，我有一个 PeriodIndex “on”：

import numpy as np
import pandas as pd


timeseries = pd.DataFrame(
        index=pd.DatetimeIndex(
            [
                "2000-01-01 12:00:00Z",
                "2000-01-01 13:00:00Z",
                "2000-01-01 14:00:00Z",
                "2000-01-02 13:00:00Z",
                "2000-01-02 18:00:00Z",
                "2000-01-03 14:00:00Z",
                "2000-01-03 20:00:00Z",
                "2000-01-04 13:00:00Z",
            ]
        ),
        data={
            "value1": [6.0, 5.0, 3.0, 7.0, 4.0, 4.0, 5.0, 3.0],
        },
    )
on = pd.PeriodIndex(
    ["2000-01-01", "2000-01-02", "2000-01-04", "2000-01-05"], freq="D"
    )

我想创建一个以“on”为索引的 Dataframe 和一个位于特定时期的日期时间列表作为数据：

                                                       groups
2000-01-01  [2000-01-01 12:00:00+00:00, 2000-01-01 13:00:0...
2000-01-02  [2000-01-02 13:00:00+00:00, 2000-01-02 18:00:0...
2000-01-04                        [2000-01-04 13:00:00+00:00]
2000-01-05                                                 []

到目前为止，我已经通过 for 循环实现了这一点：

    output_timeseries = pd.DataFrame(index=on, columns=["groups"], data=np.nan)
    for period in on:
        datetimes_in_period = timeseries.index[
            (timeseries.index >= period.start_time.tz_localize("UTC"))
            & (timeseries.index <= period.end_time.tz_localize("UTC"))
        ]
        output_timeseries["groups"].loc[period] = datetimes_in_period

为了效率起见，我想避免 Python 中的循环。我怎样才能向量化这段代码？

Answer 1

这是我的解决方案：

import pandas as pd


timeseries = pd.DataFrame(
        index=pd.DatetimeIndex(
            [
                "2000-01-01 12:00:00Z",
                "2000-01-01 13:00:00Z",
                "2000-01-01 14:00:00Z",
                "2000-01-02 13:00:00Z",
                "2000-01-02 18:00:00Z",
                "2000-01-03 14:00:00Z",
                "2000-01-03 20:00:00Z",
                "2000-01-04 13:00:00Z",
            ]
        ),
        data={
            "value1": [6.0, 5.0, 3.0, 7.0, 4.0, 4.0, 5.0, 3.0],
        },
    )
on = pd.PeriodIndex(
    ["2000-01-01", "2000-01-02", "2000-01-04", "2000-01-05"], freq="D"
    )


merge = (pd.merge_asof(timeseries.index.to_frame(),
                    on.to_timestamp().to_frame(),
                    right_index=True, left_index=True)
                    .drop('0_x', axis=1)
                    .reset_index()
                    .rename({'0_y':'date', 'index':'period'}, axis=1)
        )

#extracting from `on` any date which does not have a matching date in timestamp
unmatched_periods = on.to_timestamp().difference(merge.date).to_frame()
unmatched_periods[0] = pd.NaT

merge = merge.groupby('date').agg(func=lambda x: list(x))
unmatched_periods.columns = merge.columns
merge = pd.concat((merge, unmatched_periods))
merge

我从来没有用过

PeriodIndex

，被迫将它转换为

DateTimeIndex

和

to_timestamp

。从查看文档看来，

PeriodIndex

旨在以编程方式创建日期/期间（例如，两天之间每隔 X 天），这似乎并不是它的用途。

无论如何，解决方案的核心是使用

merge_asof

就像

merge

但它不需要相同的键，而是寻找最接近的键。默认情况下，它会向后看，这就是我们想要的（

on

中最接近的日期在

timeseries

中的日期之前）。

然后我们使用

groupby

和

agg

来获取组。

我们还需要获取

on

中的日期，它在

timeseries

中没有任何匹配（在本例中为

注意：你说你“避免”循环以提高效率。从理论上讲，这是一个好主意，但请注意，您尝试实现的结果（将列表作为列中的值）本身效率很低，在 pandas 之上

2000-01-05

也相当苛刻。

如何矢量化运行在 pandas Periodindex 上的 for 循环，我需要将日期时间排序到适当的时间段？

问题描述投票：0回答：1

1个回答

最新问题

如何矢量化运行在 pandas Periodindex 上的 for 循环，我需要将日期时间排序到适当的时间段？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1