计算多个时间序列中的重复值

Question

我有一个非常简单的问题。我有一个包含多种产品的数据集，以及它们的价格随时间的变化情况。现在我需要确定价格在连续几天内不变的时期（不包括周末 - >因此周五和周一是连续的）。

数据集：

身份证	日期	价格
1	2024-02-01	1,00
1	2024-02-04	1,00
1	2024-02-05	1,20
1	2024-02-06	1,30
1	2024-02-07	1,30
2	2024-02-01	1,30
2	2024-02-04	0,90
2	2024-02-05	0,90
2	2024-02-06	0,90
2	2024-02-07	1,30

我的输出应该看起来像这样：

身份证	开始日期	结束日期	价格	重复天数
1	2024-02-01	2024-02-04	1,00	2
1	2024-02-06	2024-02-07	1,30	2
2	2024-02-04	2024-02-06	0,90	3

如你所见，我需要避免以下问题：

2024-02-01 至 2024-02-04：如上所述，这算作连续天，即忽略 2024-02-02 和 2024-02-03 的周末）
如果同一价格在该系列中再次出现，则不应计算在内。仅当是连续几天时。
如果相同的价格（此处为 1,30）紧接着出现在列表中，但不是同一 ID，则不应将其视为连续 -->
```
shift().cumsum()
```
因此在此不起作用。

有什么想法吗？也许是

shift().cumsum()

但仅限于每个 ID 内？

到目前为止，我尝试了第一个

shift().cumsum()

的一些变体然后

groupby().agg({"date": ["min", "max"], "price": "size"}

但没有一个能同时解决上述问题。

Answer 1

你可以尝试：

def get_next_day(d):
    d += pd.Timedelta("1 day")

    if d.dayofweek == 4:
        d += pd.Timedelta("2 days")
    elif d.dayofweek == 5:
        d += pd.Timedelta("1 day")

    return d


def yield_group(g):
    if len(g) > 1:
        yield {
            "ID": g[0][0],
            "start_date": g[0][1],
            "end_date": g[-1][1],
            "price": g[0][2],
            "repeated_days_count": len(g),
        }


def generate_groups(ids, dates, prices):
    current_group = []
    prev_id = None
    prev_price = None
    expected_next_day = None

    for i, d, p in zip(ids, dates, prices):

        if prev_id is None:
            current_group.append((i, d, p))
        elif expected_next_day != d or prev_id != i or prev_price != p:
            yield from yield_group(current_group)
            current_group = [(i, d, p)]
        else:
            current_group.append((i, d, p))

        prev_id = i
        prev_price = p
        expected_next_day = get_next_day(d)

    yield from yield_group(current_group)


df["Date"] = pd.to_datetime(df["Date"])

out = pd.DataFrame(generate_groups(df["ID"], df["Date"], df["Price"]))
print(out)

打印：

   ID start_date   end_date price  repeated_days_count
0   1 2024-02-01 2024-02-04  1,00                    2
1   1 2024-02-06 2024-02-07  1,30                    2
2   2 2024-02-04 2024-02-06  0,90                    3

计算多个时间序列中的重复值

问题描述投票：0回答：1

1个回答

最新问题

计算多个时间序列中的重复值

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1